<a href="https://datamine.unc.edu/home/methods_in_medical_informatics_yuchenh/" ><h1>Back to Notebook List</h3></a>
<br/>

Welcome to chapter four of Methods in Medical Informatics! Book indexing consists of collecting significant words and their associated page numbers. A similar organization process can be applied to online text to improve organization and text processing speeds. We will be exploring scripts that demonstrate computational text indexing. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# 4.1 ZIPF Distribution of a Text File

In almost every segment of life, a small number of items usually account for the bulk of the observable activities. This pattern also hold true for words that compose a text. This phenomenon is known as Zipf's law as a mathematical description. You can write a script to illustrate the Zipf distribution for any text.*

> This script will utilzied the d2020.bin. This is a binary file which contains tens of thousands of MeSH terms. Additional information [here](https://datamine.unc.edu/datafiles_yuchenh/)

**Description adapted from pages 53-54 of "Methods in Medical Informatics"*

In [1]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

<_io.TextIOWrapper name='meshzipf.txt' mode='w' encoding='UTF-8'> 045270 the
036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that
008685 abcdefv
008624 Protein
007931 use
007519 The
007461 are
005355 from
004900 not
004744 which
004526 Receptor
004286 Cell
004124 used
004046 Syndrome
004026 specific
003870 Proteins
003837 also
003732 abbcde
003631 indexed
003416 Disease
003336 alpha
003268 Type
003139 search
003095 Agents
003043 Factor
002964 family
002962 beta
002893 coordinate
002805 Receptors
002721 Acid
002532 other
002411 abbbcdef
002378 genus
002368 may
002193 Diseases
002118 Kinase
002056 They
002035 cells
001986 confuse
001981 protein
001961 Health
001906 acid
001898 coord
001889 species
001880 abcdeef
001873 abbcdefv
001857 disease
001773 infection
001764 general
001752 Nerve
001717 but
001704 found
001679 abbcdeef
001589 available
001536 cell
001534 plant
001517 Drug
001516 has
001488 usually
001475 associated
001471 Bi

## Script Algorithm: Zipf Distribution of a Text File

Call the necessary packages*

In [22]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}

Open the necessary file to read and create a new file, meshzipf.txt, which will receive the output of the zipf distribution

In [23]:
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")

Parse the string, matching against each occurrence of a latter followed by at least 2, and at most 15, lowercase letters, with the sequence bounded on either size by a word boundary. 

In [24]:
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""

Create a dictionary object that will include words (keys) and number of occurrences (values)

In [25]:
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1

After the dictionary object is complete, format the values in the dictionary, as a zero-padded string of uniform length. 

In [26]:
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]

Sort the key-value pairs by values, descending. Print out sorted key-value pairs

In [27]:
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

<_io.TextIOWrapper name='meshzipf.txt' mode='w' encoding='UTF-8'> 045270 the
036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that
008685 abcdefv
008624 Protein
007931 use
007519 The
007461 are
005355 from
004900 not
004744 which
004526 Receptor
004286 Cell
004124 used
004046 Syndrome
004026 specific
003870 Proteins
003837 also
003732 abbcde
003631 indexed
003416 Disease
003336 alpha
003268 Type
003139 search
003095 Agents
003043 Factor
002964 family
002962 beta
002893 coordinate
002805 Receptors
002721 Acid
002532 other
002411 abbbcdef
002378 genus
002368 may
002193 Diseases
002118 Kinase
002056 They
002035 cells
001986 confuse
001981 protein
001961 Health
001906 acid
001898 coord
001889 species
001880 abcdeef
001873 abbcdefv
001857 disease
001773 infection
001764 general
001752 Nerve
001717 but
001704 found
001679 abbcdeef
001589 available
001536 cell
001534 plant
001517 Drug
001516 has
001488 usually
001475 associated
001471 Bi

**This section is adapted from section 4.1.1, "Script Algorithm", of page 54 from "Methods in Medical Informatics".*

## Analysis: Zipf Distribution of a Text File

The top entries from the MeSH file are:

`036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that`

For these scripts, the entire content of a file is loaded into a string variable. This variable is subsequently parsed into words, with each occurrence of the word counted. If the file is very large, the script can be modified to read the file line by line, incrementing the word/frequency tally for th words contained in each line. At the top of the Zipf list are the high-frequency words, such as “the”, “and”, and “was” that serve as connectors for lower-frequency, highly specific terms. Also included at the top of the Zipf list are frequently recurring letter sequences peculiar to the file; in this case, “abcdef” and “abbcdef”. Zipf distributions have many uses in informatics projects, including the preparation of “stopword” lists.*

**This section is adapted from section 4.1.2, "Analysis", of page 56 in "Methods in Medical Informatics".*

# 4.2 Preparing a Concordance

A concordance is a special type of index, listing every location of every word in the text. Concordances can be used to support very fast proximity searches (finding the locations of words in proximity to other words), and phrase searches (finding sequences of words located in an ordered sequence somewhere in the text. Using only a concordance, it is a simple matter to computationally recreate the entire text. Preparing a concordance is quite simple.*

> This script will utilized two text files, [STOP.TXT](./K11946_Files/STOP.TXT) and [TITLES.TXT](./K11946_Files/TITLES.TXT). STOP.TXT contains a list of stopwords. TITLES.TXT contains a list of 100 titles of journal articles. More information available [here](https://datamine.unc.edu/datafiles_yuchenh/)

**Description adapted from page 57 of "Methods in Medical Informatics".*

In [28]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)
keylist = word_dict.keys()
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1
von 1,54
hippel 1
lindau 1
syndrome 1
establishment 2,13
and 2,3,6,10,10,13,13,16,16,18,18,19,20,20,21,22,27,33,35,35,36,38,40,42,44,45,46,46,47,47,49,52,52,55,56,59,63,64,64,67,68,69,70,71,71,74,77,78,79,82,83,84,85,85,89,91,95,96,97,97,98,98,99
new 2
cell 2,3,8,12,13,16,20,30,33,40,42,43,43,47,48,52,52,61,64,65,65,68,69,84,85,93,96
line 2,3
derived 2,42,49,59
from 2,13,71,89,93
human 2,5,12,13,35,37,44,56,61,62,98
colorectal 2,59,81
laterally 2
spreading 2
vivo 3,10,27,82
anti 3,34,36,48,51,58,82
effect 3,51
hybrid 3,37
vac

## Script Algorithm: Preparing a Concordance

Import the necessary packages*

In [29]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0

Read the entire contents of the titles.txt file into a string variable

In [30]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()

Split the file into sentences

In [31]:
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Parse each sentence into an array of words

In [32]:
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)

Add the location of the word to the dictionary object that contains the encountered words and their locations

In [33]:
keylist = word_dict.keys()

Order the words alphabetically and print out each word in the dictionary object

In [34]:
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1
von 1,54
hippel 1
lindau 1
syndrome 1
establishment 2,13
and 2,3,6,10,10,13,13,16,16,18,18,19,20,20,21,22,27,33,35,35,36,38,40,42,44,45,46,46,47,47,49,52,52,55,56,59,63,64,64,67,68,69,70,71,71,74,77,78,79,82,83,84,85,85,89,91,95,96,97,97,98,98,99
new 2
cell 2,3,8,12,13,16,20,30,33,40,42,43,43,47,48,52,52,61,64,65,65,68,69,84,85,93,96
line 2,3
derived 2,42,49,59
from 2,13,71,89,93
human 2,5,12,13,35,37,44,56,61,62,98
colorectal 2,59,81
laterally 2
spreading 2
vivo 3,10,27,82
anti 3,34,36,48,51,58,82
effect 3,51
hybrid 3,37
vac

**This section is adapted from section 4.2.1, "Script Algorithm", of page 57 from "Methods in Medical Informatics".*

## Analysis: Preparing a Concordance

The sample text consisted of 100 parsed sentences. Here are the first few lines of the output.*

`carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1`

**This section is adapted from section 4.2.2, "Analysis", of pages 59-60 from "Methods in Medical Informatics".*

# 4.3 Extracting Phrases

All text is composed of words and phrases that represent specific concepts, that are connected together into a sequence of meaningful statements. One way to extract useful concepts is to remove common words or "stopwords". This script will demonstrate phrase extraction through stopword removal.*

> This script will utilized the text files [STOP.TXT](./K11946_Files/STOP.TXT) and [cancer_gene_titles.txt](./K11946_Files/cancer_gene_titles.txt). STOP.TXT contains a list of common stopwords. cancer_gene_titles.txt contains a list of cancer-related journal articles extracted from a PubMed query. More information [here](https://datamine.unc.edu/datafiles_yuchenh/)

**Description adapted from page 60 of "Methods in Medical Informatics".*

In [58]:
import re, string
item_list = []
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")
for item in item_list:
    print(item)
    print(item, file=out_text)
out_text.close()


1
1 25 dihydroxyvitamin d3 regulation
1 25 oh 2d3
1 3 butadiene data integration opportunities
1 4 benzoquinone
1 4 dichlorobenzene
1 d microfluidic beads array
1 lymphocytes
1 molecular target drug discovery
1 naphthol
1 nitropyrene
1 subtypes
10 10 dioxides
10 23 dnazyme inhibit
10 23 dnazymes
10 genes
10 lung cancer cell lines
10 year update
100 kb resolution
100 years
104 patients
11 year
115 cases
1193 cases
11beta hydroxysteroide dehydrogenases
11q amplification
11q13
11q13 gene amplification
12 15 lipoxygenases
12 cases
120 min
120 patients
124 expression
124i mibg
13 27 weeks old human male fetus gonads
131i
137 cesium
13q14
13q14 deletion
14
14 3 3
14 3 3 proteins
14 3 3 proteins integrate e2f activity
14 3 3 sigma
14 3 3gamma
14 3 3sigma
14 3 3sigma controls mitotic translation
14 3 3zeta
145
14th ctos annual meeting
15 lipoxygenase 1
15 lipoxygenase 2
15 pgdh
15 year old boy
154 french nf2 mutation carriers
15q24 25 1
16 bp duplication polymorphisms
16 tumor related genes
1

answers
antagonist
antagonistic forces
antagonistic pleiotropy
antagonistic roles
antagonists
antagonize bcl 6 function
antagonize rna silencing
antagonized
anterior foregut muscle development
anterior gradient 2
anterior pituitary
anterior pituitary function
anthracenes
anthracyclines
anthracyclins
anthrax toxin entry
anthrax toxins
anthropoid specific segmental duplication
anthropoids
anti adaptor protein irap
anti adipogenic regulation underlies hepatic stellate cell transdifferentiation
anti aging
anti aging drug today
anti aging pill
anti alpha enolase antibodies
anti alphagal dependent complement mediated cytotoxicity
anti angiogenesis
anti angiogenesis effect
anti angiogenesis treatment
anti angiogenic action
anti angiogenic activity
anti angiogenic cancer therapy based
anti angiogenic gene therapy
anti angiogenic targets
anti angiogenic therapies
anti angiogenic therapy
anti angiopoietin 2 treatment
anti apoptosis effect
anti apoptotic effect
anti apoptotic effects
anti apoptot

blue eyed dogs
blueprint
blurred vision
blv infection profiles
blys blys receptors
bmal1
bmi 1
bmi 1 expression predicts prognosis
bmi 1 promotes ewing sarcoma tumorigenicity independent
bmi1
bmi1 cooperate
bmp
bmp 4 induced epidermal commitment
bmp signaling
bmp suppresses pten expression
bmp2 induced osteoblast lineage commitment program
bmp4 expression
bmp4 regulates pancreatic progenitor cell expansion
bmp7
bms 275183 induced gene expression patterns
bnip3
bnip3l
bnp
body
body mass
body size
bola drb3 2 gene
bolts
bone
bone cancer
bone cell biology
bone cell function
bone cells
bone express p63
bone formation
bone forming metastases
bone induce
bone marrow
bone marrow derived mesenchymal stem cells
bone marrow derived stromal cells express lineage related messenger rna species
bone marrow failure
bone marrow favor tumor cell growth
bone marrow involvement
bone marrow involvements
bone marrow jak2v617f allele burden
bone marrow mesenchymal stem cells
bone marrow samples
bone marrow 

cd8 foxp3 regulatory t cells mediate immunosuppression
cd8 hogging
cd8 lck transgene
cd8 regulatory t cells
cd8 t cell help
cd8 t cell tolerance
cd8 t cells
cd8 t cells armed
cd8 t cells reactive
cd8 t cells sharpens immunodominance
cd80
cd80 cd86 tgf beta1
cd8alphabeta
cd90 cancer stem cells
cd96
cd99
cd99 acts
cdb 4022
cdc18 cdc6 activates
cdc2
cdc2 gene expression
cdc20
cdc20 directed apc c
cdc25 phosphatases
cdc25 phosphatases expression
cdc25 phosphatases structure specificity
cdc25a phosphatase
cdc25b expression
cdc25b functions
cdc25c
cdc42
cdc6
cdc6 knockdown inhibits human neuroblastoma cell proliferation
cdc7 kinase mediates claspin phosphorylation
cdh1
cdh1 apc
cdh1 germline missense variants
cdh1 polymorphisms tooth agenesis
cdk inhibitor p18ink4c
cdk inhibitors cell cycle regulators
cdk inhibitors potential targets
cdk1
cdk11 p58
cdk2
cdk4
cdk4 cooperatively control
cdk4 inhibits
cdk6
cdk8
cdk9 cyclin t1 complex
cdk9 cyclin t2 complex
cdk9 phosphorylates p53
cdkn1c p57kip2

colorectal cancers
colorectal cancers differ based
colorectal carcinogenesis
colorectal carcinogenesis 1 hereditary predisposition
colorectal carcinogenesis road maps
colorectal carcinoma
colorectal carcinoma assessed
colorectal carcinoma cells
colorectal carcinoma molecular markers
colorectal carcinoma news
colorectal carcinoma tissues
colorectal disease
colorectal liver metastases
colorectal liver metastasis
colorectal micrornaome
colorectal neoplasia
colorectal neoplasia implications
colorectal patients
colorectal polyposes
colorectal polyposis
colorectal polyps
colorectal serrated adenocarcinoma
colorectal serrated adenoma diagnostic criteria
colorectal serrated lesions
colorectal tumor
colorectal tumor growth
colorectal tumorigenesis
colorectal tumors
colorectum
colours
columnar cell lesions
columnar epithelia
com 1 p8 acts
combat cancer
combination
combination targeted therapy
combination therapies
combinatorial effects
combinatorial patterns
combinatorial regulation
combinatoria

developmental origin
developmental perspective
developmental potential
developmental program
developmental signalling pathways
developmental status
developments
dexamethasone
dexamethasone impairs hypoxia inducible factor 1 function
dexamethasone inhibits
dexd h box rna helicase lgp2 manifests disparate antiviral responses
dextran sulfate sodium
dfna5
dgem
dha induced apoptosis
dhplc
dhplc screening strategy
dia1 controls melanoma proliferation
diabetes
diabetes injury
diabetes patients
diabetic erectile dysfunction
diabetic mice
diabetic nephropathy
diabetic rats
diabetic retinopathy
diadenosines
diagnose blv genome
diagnosing hereditary breast cancer syndromes
diagnosis
diagnosis epidemiology
diagnosis genetics
diagnosis prognosis
diagnosis staging
diagnostic
diagnostic algorithm
diagnostic biomarker
diagnostic importance
diagnostic testing
diagnostic utility
diagnostic value
diagnostics
diagnostics amid debate gene based cancer test approved
dialectic role
dialysis
diamond blackfan 

endolymphatic sac tumors surgical management
endolyn
endometrial cancer
endometrial cancer appearance
endometrial cancer cells
endometrial cancer management
endometrial cancer new light
endometrial cancer patients
endometrial cancer risk
endometrial carcinoma
endometrial carcinoma cell
endometrial carcinoma hope
endometrial carcinoma pathology
endometrial function
endometrial mucosa
endometrial receptivity
endometrial stromal sarcoma
endometrial stromal sarcoma cell line
endometrioid adenocarcinoma
endometrioid endometrial carcinogenesis
endometriosis
endometrium
endonuclease g
endonuclease g selectively kills polyploid cells
endoplasmic reticulum
endoplasmic reticulum ca 2 store
endoplasmic reticulum mechanisms
endoplasmic reticulum photodamage
endoplasmic reticulum stress
endoproteolysis
endosalpingeal development
endoscopy
endosomal cargo
endosonographers
endostatin gene therapy inhibits tumor growth
endostatin therapy reveals
endothelial barrier function
endothelial cell function
e

folate metabolism polymorphisms
folate metabolism polymorphisms influence risk
folate metabolizing enzymes
folate peg baculovirus
folate receptor alpha
folate receptor expression
folate receptors
folate related cancer pathologies
folate related genes
folate status
folate tethered emulsion
folic acid
folic acid fortification
follicle stimulating hormone receptor polymorphism
follicular dendritic cell sarcoma
follicular lymphoid hyperplasia
follicular lymphoma
follicular lymphoma frequently originates
follicular lymphoma international prognostic index
follicular lymphoma today treatments
follicular origin
follicular patterned lesion
follicular t cell markers
follicular thyroid cancer cells
follicular thyroid carcinoma
follicular thyroid tumors
follicular variant papillary thyroid carcinoma
folliculotropic mycosis fungoides
follistatin gene
follistatin promoter
follow
food
food flavoring agent maltol
food patterns
footprinting screen
footprints
forbidden alphabeta tcr
forced expression
fo

glucose regulated protein 78
glucose regulates
glucose sensor
glucose uptake sensitizes cells
glucosinolates
glucuronidation
glur1 expression
glut7
glutamate induced damage
glutamine prevents dmba induced squamous cell cancer
glutathione
glutathione depletion
glutathione metabolism
glutathione peroxidase 1
glutathione peroxidase 2 gpx2 promoter
glutathione s transferase gstt1
glutathione s transferase omega gene
glutathione s transferase p1
glutathione transferase p
gluten ataxia recognize
gly400val
glycerol kinase overexpression
glyco gene expression
glycodelin
glycodelin gene expression
glycodelin reduces breast cancer xenograft growth
glycogen storage disease types
glycogen synthase kinase 3
glycogen synthase kinase 3 beta
glycogen synthase kinase 3 gsk3 inflammation diseases
glycogen synthase kinase 3beta
glycolipids
glycolysis
glycolytic mechanism regulating
glycomedb integration
glycoprotein expression
glycoprotein iib
glycosylation
glyoxalase
glypican 3
glypican 5
gm csf
gm csf 

hpaa
hphf1
hpp1 mediated tumor suppression requires activation
hpttg1 securin
hpv
hpv 16
hpv 18 dna
hpv 18 e7 conjugates
hpv detection
hpv induced carcinogenesis
hpv induced cervical carcinogenesis
hpv infected head
hpv infection
hpv negative vulvar carcinoma
hpv status
hpv type 16 li antigen expressing tumor model
hpv16 early region transcription
hpv16 gene copy number quantification
hpv16 genome
hpv16 variants
hpv58 e6 gene
hr gp100 protein
hrad9
hras
hras exhibit different leukemogenic potentials
hras mutation causes costello syndrome
hras1 gene
hrasls2 gene
hrk gene
hrk inactivation associated
hrpt2 gene
hrpt2 gene alterations
hrsl3 tumor suppressor function
hsd17b1 genetic variants
hsf1
hsmr3a
hsnf5 ini1 mutation analysis
hsp27
hsp27 stress response
hsp60 expression
hsp60 regulation
hsp70
hsp70 expression
hsp70 hom genetic variant
hsp70 inducible hnis ires egfp reporter imaging response
hsp70 interact
hsp701a induces k562 cells apoptosis
hsp70b regulation
hsp72
hsp72 protects
hsp9

immunoregulatory t cells role
immunostimulatory activity
immunosuppression promotes reovirus therapy
immunosuppressive drug
immunosuppressive properties
immunotherapeutic effect
immunotherapeutic target
immunotherapy
immunotherapy starts
immunotherapy targeting ebv expressing lymphoproliferative diseases
impact
impact etiology
impact mechanisms
impacts
impair liver regeneration
impaired
impaired adipogenesis caused
impaired angiogenesis
impaired control
impaired differentiation
impaired expression
impaired glomerular maturation
impaired gonadal development
impaired hepatic regeneration
impaired insulin secretion
impaired microrna processing enhances cellular transformation
impaired phagocytic mechanism
impaired steroidogenesis
impaired synaptic plasticity
impaired trna nuclear export links dna damage
impairment
impairs vessel reactivity
imperfect oligonucleotides
implementation
implementing
implicated
implication
implications
importance
important
important consequences
important disord

leprosy
leptin
leptin action
leptin augments proliferation
leptin effect
leptin ghrelin
leptin induces inflammation related genes
leptin receptor
leptin receptor expression
leptin receptor ob r
leptin signaling
leptin stat3 signaling
leptomycin b sensitive
lesch nyhan disease
lesion
lesions
lesser evil prophylactic mastectomy
lesser extent noxa
lessons
lessons learned
lethal activity
letrozole
leucine rich repeat kinase 2 associates
leukaemia
leukaemia cell lines
leukaemia cells
leukaemia impact
leukaemia lineage specification caused
leukaemia stem cell development
leukaemias
leukaemic cell line
leukaemic transformation
leukaemogenesis
leukaemogenic mechanism
leukemia
leukemia cell
leukemia cell lines
leukemia cells
leukemia cells hl 60
leukemia genetics
leukemia group b gastrointestinal cancer committee
leukemia induction
leukemia initiating cells
leukemia lymphoma associated gene fusions
leukemia patients
leukemia patients reveals new abo variant alleles
leukemia protein
leukemia ste

metastatic brain cancer
metastatic breast cancer
metastatic breast cancer cells
metastatic breast cancer survival
metastatic cancer fatal attraction
metastatic cancers
metastatic cell origin
metastatic colon cancer stem cells
metastatic colorectal cancer
metastatic disease
metastatic dormancy
metastatic esophageal carcinoma masquerading
metastatic human bladder cancer
metastatic human gastric cancer
metastatic liver tumors
metastatic lung adenocarcinoma
metastatic medullary thyroid carcinoma
metastatic melanoma
metastatic melanoma cells
metastatic non small cell lung cancer
metastatic osteosarcoma gene expression differs
metastatic pancreatic tumors
metastatic phenotype
metastatic potential
metastatic process
metastatic progression
metastatic properties
metastatic prostate cancer cells
metastatic renal cancer
metastatic renal cell carcinoma
metastatic spread
metastatic spread mechanisms
metastatic testicular teratoma
metastatic tracks
metastatic tumor antigen 3
metastatic tumor cell de

multifork replication
multifunctional adhesion receptor
multifunctional nanoparticulate polyelectrolyte complexes
multifunctional ns1 protein
multifunctional protein ctcf
multifunctional protein sparc
multifunctional roles
multifunctional transcription factor yy1
multifunctional zinc finger protein
multigene predictors
multigenic approach
multilevel inference
multimer mediated mayhem
multimodal imaging
multimodal tumor inhibitor
multimodality approach
multipathway disease
multiple
multiple aberrations
multiple adh genes
multiple alternative splicing markers
multiple anti survivin hammerhead ribozymes
multiple aspects
multiple b
multiple battles fought
multiple bladder tumors
multiple cancer cell lines
multiple cancers
multiple cell cycle checkpoints
multiple chromosomal loci
multiple cistronic vectors fmdv 2a
multiple colon carcinoma
multiple colorectal neoplasms
multiple components
multiple cutaneous metastases
multiple cysts
multiple dna repair pathways
multiple drug resistance induc

nonintegrating lentiviral vectors
noninvasive assessment
noninvasive bronchioloalveolar carcinoma
noninvasive cell tracking
noninvasive imaging
noninvasive monitoring
noninvasive papillary urothelial neoplasms
noninvasive prenatal diagnosis
nonlinear tests
nonlinear transformation models
nonmelanoma skin cancer
nonmelanoma skin cancers
nonparametric fdr estimation revisited
nonparametric linkage analysis
nonparametric pathway based regression models
nonreceptor protein tyrosine phosphatases
nonrigid registration
nonselective packaging
nonsense
nonsense codons trigger
nonsense mediated decay pathway
nonsense mediated decay rna surveillance pathway
nonsense mediated mrna decay efficiency
nonsense mutation
nonsense mutation 193c t
nonsense mutation e1978x
nonsense mutations
nonsense mutations causing human genetic disease
nonsmall cell lung cancer
nonsmokers
nonsurgical management
nonsyndromic mental retardation
nonsynonymous polymorphisms
nontoxic silencing
nontumoral adenohypophyses
non

p53 dependent pathways
p53 dependent protein phosphorylation
p53 dependent senescence
p53 determines multidrug sensitivity
p53 directs focused genomic responses
p53 dna binding domain regulates apoptosis induction
p53 downregulates expression
p53 downstream target genes
p53 dying
p53 enhances ascorbyl stearate induced g2 m arrest
p53 enters
p53 exon 4 mutations
p53 expression
p53 family
p53 family isoforms
p53 family member p73
p53 family members
p53 family members regulate
p53 family prospect
p53 family proteins
p53 friends acquaintances
p53 function
p53 function leads
p53 gene
p53 gene alterations identified
p53 gene family transactivate pkcdelta
p53 gene mutation
p53 gene mutations
p53 gene regulatory networks
p53 gene replacement therapy
p53 genes
p53 genetic polymorphism
p53 genotype
p53 genotypes
p53 guardian
p53 heterozygous irradiated mice
p53 hgf c met stat3 signal
p53 immunostaining
p53 inactivated mammary epithelial cells
p53 inactivating oncogene wip1 ppm1d
p53 independent 

postpartum pony mares
postprandial lipemia
postradiation sensitization
posttranscription regulation
posttranscriptional orchestration
posttranslational mechanisms
posttranslational phosphorylation
posttranslational regulation
posttransplant denervated liver
posttransplant lymphoproliferative disorders
potassium channel antagonists
potassium channel blockers
potassium channel expression
potassium channels
pote family proteins
potent activator
potent antiglioma effect
potent antitumor effects
potent cytotoxic photoactivated platinum complex
potent gene silencing
potent inhibitor
potent non viral gene delivery
potent oncogenes
potent oxidizing
potent p53 independent tumor suppressor activity
potent regulator
potent selective nur77 modulators
potential
potential anticancer agents
potential application
potential applications
potential biological function
potential biological target
potential biomarker
potential biomarkers
potential cancer therapeutic target
potential carrier
potential chemo

rare entity
rare entity report
rare event
rare expression
rare form
rare human nucleotide polymorphisms
rare lung diseases
rare model
rare mutation
rare occurrence
rare pik3ca hotspot mutations
rare sequence variants
rare tumor
rare variants
rarely mutated
rars modulate aimp1 emap ii secretion
ras
ras activation
ras activation regulate recq helicase gene expression
ras components
ras dependent carbon metabolism
ras effector gene rassf2
ras erk signaling
ras flt3
ras function
ras gene
ras gene family
ras induced oncogenic transformation
ras induced senescence
ras induces chromosome instability
ras mapk pathway
ras mapk signaling
ras mapk signaling cascade
ras mediated epigenetic silencing
ras mediated tumorigenesis
ras mutated cells
ras mutation promotes p53 activation
ras mutations
ras oncogene
ras oncogene mutations
ras oncogenes
ras oncogenes split personalities
ras paradox
ras pathway activation
ras proteins nitrosylation
ras proteins paradigms
ras raf mek erk
ras regulation
ras rel

secretoglobin 3a2
secretor genotypes
secretory leukocyte peptidase inhibitor
secretory protein
secretory protein splunc1
secrets
security guard
sediments
seeded bayesian networks constructing genetic networks
seeding
seek
seeking completeness
segment
segment number
segmental aneuploid
segmentation
segregation analysis
seizure 6
sel1l
sel1l expression
seldi tof ms
selected areas
selected gene amplification
selected messenger rnas
selected polymorphisms
selected types
selecting adult stem cells
selecting antitumor therapy
selecting highly affine
selecting normalization genes
selection
selective
selective activation
selective activity
selective cancer germline gene expression
selective cellular screening assay
selective chromosome amplification
selective control
selective cytotoxic t lymphocyte targeting
selective estrogen receptor modulators
selective estrogen receptor modulators serms
selective gene delivery
selective gene induction
selective induction
selective inhibition
selective inh

streptococcus pneumoniae
streptococcus pyogenes
streptococcus sanguinis sortase
stress
stress affects uterine receptivity
stress atf6alpha
stress damage response
stress induced genes
stress induced mutation
stress induced thymic atrophy
stress induction
stress phenotype
stress protein response
stress regulated switch
stress resistant tumors
stress response
stress response protein
stress signaling
stress signals
stress specific changes
stress substrate modulates carcinogenic pathways
stressed
stressed marrow foxos stem tumor growth
stressful situation
stressing
stroke
stroke upregulates tnfalpha transport
stroma
stroma sensitivity
stromal anti apoptotic androgen receptor target gene c flip
stromal effects
stromal epithelial interaction
stromal gene expression predicts clinical outcome
stromal gene signature associated
stromal gene signatures
stromal induction
stromal parenchymal interactions
stromelysin gene expression
strong association
strong cancer specific proapoptotic effect
strong

transcriptional control
transcriptional corepressor ctbp
transcriptional cycle
transcriptional deregulation
transcriptional dysregulation
transcriptional effects
transcriptional expression
transcriptional factor sp1
transcriptional features
transcriptional inhibitors p53
transcriptional integration
transcriptional interference
transcriptional level
transcriptional network controlling pluripotency
transcriptional networks inferred
transcriptional origin
transcriptional processing
transcriptional profile
transcriptional profiles
transcriptional profiling
transcriptional program mediating entry
transcriptional programs regulated
transcriptional regulation
transcriptional regulator
transcriptional regulators
transcriptional regulatory mechanism
transcriptional regulatory network
transcriptional repression
transcriptional repressor
transcriptional repressor hey1
transcriptional repressor rest
transcriptional role
transcriptional stability
transcriptional switch
transcriptional target
transc

vivo analysis
vivo angiogenesis imaging
vivo applications
vivo bioluminescence imaging
vivo biomarkers
vivo challenges
vivo conditions
vivo differentiation
vivo dynamics
vivo electroporation
vivo expression
vivo expression set
vivo functions
vivo gene expression
vivo gene expression signature
vivo gene transfer
vivo gene transfer activity
vivo guided angiogenesis
vivo identification
vivo imaging
vivo immunization
vivo induced genes
vivo loss
vivo luminescent imaging
vivo melanoma
vivo models
vivo modulation
vivo molecular
vivo mutagenic effect
vivo mutation data
vivo neurodegeneration
vivo optical imaging
vivo pinpoint drug delivery
vivo processes
vivo protein dna interactions
vivo radioresponse
vivo rate
vivo reprogramming
vivo restoration
vivo sampling
vivo selection
vivo sirna delivery
vivo tumor cell targeting
vivo validation
vivo veritas
vkorc1 polymorphisms
vntr polymorphism
vocal fold injury
von hippel lindau disease
von hippel lindau disease urologic considerations
von hippel l

## Script Algorithm: Extracting Phrases

Call necessary packages*

In [59]:
import re, string
item_list = []

Open the STOP.TXT file, containing a list of common stopwords. Split into a list structure

In [60]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open cancer_gene_titles.txt

In [61]:
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0

Pare through the lines of the text. Substittue a newline character for every occurrence of any stopword in the sentence.

In [None]:
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")

Sort item alphabetically and print

In [42]:
for item in item_list:
    print(item)
    print(item, file=out_text)
out_text.close()

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 25 dihydroxyvitamin d3 regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 25 oh 2d3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 3 butadiene data integration opportunities
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 4 benzoquinone
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 4 dichlorobenzene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 d microfluidic beads array
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 lymphocytes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 molecular target drug discovery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> 1 naphthol
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'>

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhere
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adherent invasive escherichia coli
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion molecule l1 cd171 promotes melanoma progression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion molecules
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion programme
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion receptor signaling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhesion receptors mediate efficient non viral gene delivery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adhfe1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adipocyte
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> adipocyte death adipose tissue re

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> answers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonist
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonistic forces
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonistic pleiotropy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonistic roles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonists
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonize bcl 6 function
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonize rna silencing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> antagonized
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> anterior foregut muscle development
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> anterior gradient 2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> anterio

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuated dna damage repair
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuated familial adenomatosis polyposis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuated familial adenomatous polyposis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuated heat shock response
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuates
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attenuation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attorney
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attractin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attractive target
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> attributes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> atypical
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blame
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blastic mantle cell lymphoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blastic natural killer cell leukaemia
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blastoid variant mantle cell lymphoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blasts
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blimp 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blimp1 regulates cell growth
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blind alley
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> block
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blockade
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blockage
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> blocked autophagy sensitizes 

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calpain 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calpain mediated androgen receptor breakdown
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calpain mediated cleavage
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calpain small 1 modulates akt foxo3a signaling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calpains
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calponin h1 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calreticulin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calretinin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calu 1 lung cancer cell line
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> calvaria
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cam kinase ii
<_io.TextIOWrapper name='phrases.txt' mode='w' en

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd54
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd55
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd55 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd56 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd56 immunophenotype
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd59
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd68
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd70
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd8
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd8 foxp3 regulatory t cells mediate immunosuppression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd8 hogging
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cd8 lck transgene
<_io.TextIOWrapper name='phrases.txt' mode='w' encodi

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> choline metabolism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> choline transport
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cholinesterases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrocyte hypertrophy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrocytic gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrogenesis leads
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrogenic cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondroid chordoid dilemma resolved
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrosarcoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chondrosarcoma gene profile implications
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> chordin
<_io.TextIOWrapper name='phrases.txt' m

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer cells express functional cell surface bound tgfbeta
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer detection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer epidemiology mechanisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer genetics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer liver metastasis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer morphogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer new pieces
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer new targets
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer oncogene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> colorectal cancer pathogenesis

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> curcumin inhibits wt1 gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> curcumin modulate
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cure
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cure gene therapy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> cure human disease
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> curious case
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> current approaches
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> current challenges
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> current controversies
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> current dendrimer applications
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> current development
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> development regeneration
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> development symposium
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental abnormalities
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental apoptosis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental aspect
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental biology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental biology implications
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental defects
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental dependence
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental diseases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> developmental disorders
<_io.TextIOWra

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dpl dna peptide lipid complex
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dpp iv cd26
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dpyd 2a mutation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dpyd genotyping
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dqa1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dqb1 allele typing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dqb1 association
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dr haifan lin interviewed
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dr tian xu interviewed
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dr5 receptor mediates anoikis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> dram
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> drash
<_io.TextIOWra

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endoglin cd105
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endoglin cd105 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endoglin haploinsufficient mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endolymphatic sac tumors surgical management
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endolyn
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer appearance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer management
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer new light
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> endometrial cancer patient

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolutionary origins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolutionary plasticity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolutionary selection pressure
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving concept
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving concepts
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving functions
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving gene therapy approaches
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving role
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving science
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving strategies
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> evolving targets
<_io.TextIOWrapper name='phrases.txt' mode='w' encodi

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescence based method
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescence microscopy imaging
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescence probe
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescent
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescent multiplex dgge screening test
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorescent proteins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluorochromes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluoropyrimidine chemotherapy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluoropyrimidine sensitivity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> fluoxetine
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> flushing pheochromocytoma
<_io.TextIOWrapper name

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression pattern scanner
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression patterns
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression patterns associated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profile
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profile analysis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profile class prediction
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profile related
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profiles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profiles associated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> gene expression profiles predict secondary leukaemia risk
<_io.Tex

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global gene expression profiling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global genome damage score predictive
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global genomic instability
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global impact
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global issues
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global mapping
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global metabolic effects
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global pathway crosstalk network
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global protein expression analysis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global proteome profiling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> global public health prob

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis b virus replication causes oxidative stress
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis b virus rta181t surface truncation mutant
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis b virus x gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c based
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c identifying patients
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c viral infection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c virus
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c virus binding
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> hepatitis c virus core protein
<_io.TextIOWrapper name='phrases.txt' mode

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host genome surveillance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host immune
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host immune response
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host immunity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host immunogenetics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host microrna regulatory network
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host nuclear factor kappab activation potentiates lung cancer metastasis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host related factors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host response
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host target sequence
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> host tissue irradiation
<_io.Te

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human p53 binding sites cell cycle
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human p53 regulated genes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic adenocarcinoma cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic cancer cell line
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic cancer cell lines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic cancer cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic cancer cells induced
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic carcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> human pancreatic carcinoma cells
<_io.TextIOWrapper name='phras

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune escape genes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune escape mechanisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune evaluation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune evasion strategies
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune function
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune functions
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune mediated inflammatory diseases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune pathogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune prognostic factors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> immune reaction
<_io.TextIOWrapper name='phrases.txt' mo

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 12 deficient mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 15
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 15 increases hepatic regenerative activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 17
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 17 gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 18
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 18 gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 18 gene promoter polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 1b polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 1beta
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> interleukin 1b

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l deficient mouse brain lysosomes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l h gray
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l isoaspartate
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l monocytogenes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l myc gene polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l myc polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l pk gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l pneumophila
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l type amino acid transporter 1 expressed
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l type ca2 channels
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l1 cam
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> l1 mobile 

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphoblastoid cell lines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte activation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte development
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte effector molecule perforin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte homeostasis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte homing imprinting
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocyte predominant hodgkin lymphoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocytes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocytes depends
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphocytic pleural effusion
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> lymphoid cell transform

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma expressing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma genesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma hope
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma initiating cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma invasion
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma invasion current knowledge
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma lack
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma lost
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma malignant phenotype
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> melanoma metastases
<_io.TextIOWrapper name='phrases.txt' mode='w' encod

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitogen activated protein kinase scaffolding
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitogen activated protein kinases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitogenic signaling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitogens
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitosis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitosis independent survivin gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitosis springtime
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitosis tracking
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitotic arrest
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitotic cdc2 kinase
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mitotic checkpoint gene sil
<_io.TextIOWrapper name='

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mucosal immunity induced
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mucosal nod2 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mucosal protection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> mucositis research new insights
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> muir torre syndrome
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> muir torre syndrome diagnostic
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> multi class cgh data
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> multi dimensional genomic data
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> multi drug resistance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> multi drug resistant genes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> multi drug tolerance
<_io.TextIOW

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative regulator
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative regulator mir 26a
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative regulators
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative role
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative selection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative thinking
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negative transcriptional element
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negatively regulate type
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negatively regulated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> negatively regulates ap 1 activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> neil1 dna glycosylase
<_io.TextIOWrapper name='phrases

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung cancer nsclc
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung cancer patients
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung cancer smokers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung cancer tissues
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung cancer transcriptome microarray
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung carcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung carcinoma cell lines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non small cell lung carcinoma nsclc
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non smokers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> non solid oncogenes
<_io.TextIOWrapper n

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity epidemic pharmacological challenges
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity induced insulin resistance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity lessons
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity related diseases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity related mammary carcinogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity related traits
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity relevant
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obesity risk
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> obestatin levels
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> observational microarray
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> observations
<_io.TextIOWrapper

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38 map kinase targeted cell permeable peptide
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38 mapk
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38 mapk stress pathway
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38alpha
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38alpha map kinase
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p38mapk delta controls c myb degradation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p400
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p450 1a1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p450c17 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p52
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p53
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> p53 aberration
<_io.TextIOWra

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptide repertoire
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptides
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptides bound
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptidomimetic conjugated self assembled nanoparticles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptidomimetic sirna transfection reagent
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptidylarginine deiminase 4
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> peptidylprolyl isomerase pin1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> per2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> percutaneous radiofrequency thermal ablation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> perfect dna molecules
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> perfect match


<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post transcriptional gene regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post transcriptional processing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post transcriptional regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post transcriptional roles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post translational modifications
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post translational modifications regulate
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> post transplant lymphoproliferative disorders
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> posterior corneal dystrophy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> posterior mediastinum
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> posterior uveal melanoma
<_io.TextIOWrapper name='p

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostaglandin e2 induces
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostaglandin e2 inhibits tumor necrosis factor alpha rna
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostaglandin e2 regulates
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostaglandin transporters
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostaglandins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostanoid receptor ep1 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostasomes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostate
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostate adenocarcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostate biology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> prostate bone metastases
<_io.TextIO

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> ras rtn3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> ras transformation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> ras transformation requires metabolic control
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> ras transformed fibrosarcoma cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> ras unplugged negative feedback
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rasa1 causes capillary malformation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rasgrf2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rash2 mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rash2 mice produced
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rasl11b knock
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> rassf family
<_io.TextIOWrapper name='phrases.txt' mode

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma pathways
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma patients
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma predisposition
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma progression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma protein
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma protein regulates pericentric heterochromatin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma proteins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma regulatory pathway
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma related cell cycle regulator p107
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> retinoblastoma tumor formation
<_io.TextIOWrapper name

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> semenogelin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> semi allogeneic vaccine
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> semi allogeneic vaccines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> semi supervised discovery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> seminal vesicle
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> seminoma cell line tcam 2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> seminoma human testis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> senescence
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> senescence associated beta galactosidase
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> senescence associated heterochromatin foci
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> senescence breaking
<_io.TextIOWrapper 

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solid tumor microenvironment
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solid tumor stem cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solid tumors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solid tumors eml4 alk fusion genes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solid tumors pinpointing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solitary fibrous tumor
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solitary juxtapapillary capillary retinal angioma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> solitary lung nodules
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> soluble
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> soluble basigin ligand
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> soluble nkg2d ligands prevalence r

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppressing wnt signaling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppressive effects
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppressive mechanisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppressor
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> suppressors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> sural nerve
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> surface modified lpd nanoparticles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> surface molecules
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> surfactant protein d protects
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> surfactant proteins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> s

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th1 cytokines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th1 mediated inflammation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th17
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th2 cell identity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th2 cytokine gene polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> th2 lymphocytes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> thai patients
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> thai population
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> thais
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> thalidomide
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> thapsigargin resistance
<_io.TextIOWrapper name='phrases.

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating advanced clear cell renal carcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating advanced prostate cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating brca deficient tumors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating hepatitis b viral infection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating hepatocellular carcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating metastatic breast cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating myeloma bone disease
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treating triple negative breast cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treatment
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> treatment approaches
<_io.TextIOWrapper name='phrases.tx

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unprecedented g quadruplex scaffold
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unraveling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unraveling estrogen action
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unraveling human cleft lip
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unraveling mysteries
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unregulated smooth muscle myosin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unrelated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unrelated hematopoietic cell transplantation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unscheduled overexpression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unseen
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> unselected hht patients
<_io.TextIOWrappe

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt survival guide
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt1 overexpression promotes tumor progression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt11 supports self renewal
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt13 isoforms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt3a
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt4
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt4 pathway
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt5a
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt5a crip1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt5a gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='UTF-8'> wnt9a mrna levels increases cellular proliferation
<_

**This section is adapted from section 4.3.1, "Script Algorithm", of page 61 from "Methods in Medical Informatics".*

## Analysis: Extracting Phrases

The output is an alphabetic file of the phrases that might appear in a book's index. We used the file consisting of titles from a PubMed search. This file, cancer_gene_titles.txt, is about 1.1 MB in length, the size of a typical book. We only required about a dozen lines of code and a few seconds of execution time to create out list of index terms.*

**This section is adapted from section 4.3.2, "Analysis", of page 63 from "Methods in Medical Informatics".*

# 4.4 Preparing an Index

An index is a list of the important words or phrases contained in a book, along with the locations where each of those words and phrases can be found. This is different from concordance because the index does not contain every word found in the text, and the index contains groups of selected phrases, in addition to individual words. Software can be used to create indexes. However, remember that a useful index is more selective than simply recording the location of every word and phrase.*

> This script will utilized the text files [STOP.TXT](./K11946_Files/STOP.TXT) and [TEXT.txt](./K11946_Files/TEXT.TXT). STOP.TXT contains a list of common stopwords. TEXT.txt contains a sample journal article. More information [here](https://datamine.unc.edu/datafiles_yuchenh/)


**Description adapted from page 63-64 of "Methods in Medical Informatics".*

In [43]:
import re
import string
item_list = []
item_dictionary = {}
place_string = ""
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TEXT.TXT', 'r')
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
norm = str.maketrans('','',string.printable)
badascii = str()
badascii = badascii.translate(norm)
badascii_table = badascii + (256 - len(badascii))*" "
junk_table = 256*" "
table = str.maketrans(badascii_table,junk_table)
count = 0
for item in sentence_list:
    count = count + 1
    count_string = str(count)
    item = item.lower()
    item = re.sub(r'\'s', "", item)
    item = item.translate(table)
    for stopword in stop_list:
        stopword = stopword.rstrip()
        item = re.sub(r' *\b' + stopword + r'\b *', '\n', item)
    item_list = item.split("\n")
    for phrase in item_list:
        phrasematch = re.match(r'^[0-9]', phrase)
        if (phrasematch):
            continue
        if phrase in item_dictionary:
            item_dictionary[phrase] = item_dictionary[phrase] + ',' + count_string
        else:
            item_dictionary[phrase] = count_string
keylist = item_dictionary.keys()
keylist = sorted(keylist)
for key in keylist:
    print(key, item_dictionary[key])

 2,4,4,4,4,6,6,6,6,6,6,7,7,7,7,10,10,11,11,11,11,11,11,11,11,12,12,12,12,12,12,13,15,16,16,17,18,18,18,18,19,19,19,19,19,19,19,20,20,20,20,22,22,24,25,25,26,26,26,26,26,26,27,28,28,28,28,29,29,30,30,30,31,31,31,32,32,32,33,33,33,34,34,34,34,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,37,37,37,37,37,38,39,39,39,40,41,42,42,43,44,44,44,44,45,45,45,45,45,45,45,45,45,45,45,45,45,46,47,47,47,47,48,48,48,48,48,48,49,49,49,49,49,50,50,50,50,50,50,50,50,50,50,51,51,51
(6) fatal (abbreviations 22
(umls), extracting 163666 abbreviation/expansion pairs.3 35
, liu et al3 studied 35
- 18
: ap = anterior-posterior; l = left 43
abbreviation classes 11
abbreviation list contained 12097 terms; 5772 abbreviations 15
abbreviations 2,6,7,18,26,30,32,32,33,34,35,35,42,44,48,50,50
abbreviations classed 22
abbreviations ended 20
abbreviations fell 9
abbreviations reveals 34
abbreviations, 48
abbreviations.4 35
accurate algorithms 32
accurate lists 32
acronyms 39,40
adjectives 42
algorithmic approac

In [17]:
with open("./K11946_Files/STOP.TXT","r") as stopfile:
    stop_list = stopfile.readlines()
    
 
stop_list = [item.strip() for item in stop_list]

with open("./K11946_Files/TEXT.TXT","r" ) as in_text:
    in_text_string = in_text.read()
 

in_text_string = in_text_string.replace("\n", " ")
in_text_string = re.sub(r"\s+", " ", in_text_string)

sentence_list = re.split(r"[\.\?\!](?!\d)\s*(?=[A-Z])", in_text_string)
 
# Step 1: create an empty dictionary to hold the phrases and their sentence numbers
phrase_dict = {}

# Step 2: iterate over the sentences, split them into phrases, and store them in an array
for i, sentence in enumerate(sentence_list):
    phrases = [word for word in sentence.split() if word.lower() not in stop_list]
    
    # Step 3: iterate over the phrases, add them to the dictionary, and update their sentence numbers
    for phrase in phrases:
        if phrase not in phrase_dict:
            phrase_dict[phrase] = str(i)
        else:
            phrase_dict[phrase] += ',' + str(i)

# print the resulting phrase dictionary
#print(phrase_dict)


keylist = list(phrase_dict.keys())
 
keylist.sort()

for key in keylist:
    print(key, phrase_dict[key])


(1) 21
(2) 21
(3) 21
(4) 21
(5) 21
(6) 21
(UMLS), 34
(abbreviations 21
12000 3,36,46
12097 14
163666 34
2 37
5772 14
6 21
6325 15
8599 16
= 42,42
AP 42
Abbreviations 1,8
Acronyms 38
American 18
Automatic 31
British 18
CABG, 40
Classes 11
Collecting 23,26
Conclusion 22
Context 0
Design 7
Efforts 2
Examples 42
Expanding 25
L 42
Language 12,16,34
Liu 34,35
Measurements 13
Medical 12,16,34,37
Objective 4
Perl 12
PubMed 33
Recently, 34
Results 20
Strangely, 34
Unified 12,16,34
abbreviation 10,14
abbreviation/expansion 15,34
abbreviations 2,5,6,10,11,14,16,17,18,19,21,23,23,24,25,26,27,28,29,31,31,31,32,33,33,34,34,34,34,35,37,41,43,46,47,48,49,49
abbreviations, 3,36,47
abbreviations.4 34
accurate 31,31
acronyms 37,39
adjectives 41
age, 30
al3 34,35
algorithm, 21
algorithmic 9,11,27,47
algorithmically 34
algorithms 23,31,50
ambiguity 25,35
ambiguous 35
amenable 9
annulare 30
anterior-posterior; 42
appearance 44
approach 23
approaches 9,11,47
artery 40
article, 46
assign 27
assist 10
attentio

## Script Algorithm: Preparing an Index

Create an array containing stopwords. You can use any stopword list you prefer.
In this script, we use STOP.TXT available [here](./K11946_Files/STOP.TXT) 

In [44]:
import re
import string
item_list = []
item_dictionary = {}
place_string = ""
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open a file to be indexed. You can use any file, but in this text, we use text.
txt, available at http://www.julesberman.info/book/text.txt

In [45]:
in_text = open('./K11946_Files/TEXT.TXT', 'r')
in_text_string = in_text.read()

Strip the text of any non-ASCII characters (not necessary if you are using a
plain-text file).

In [46]:
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")

Split the text into sentences and put the consecutive sentences into an array.

In [47]:
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Create a dictionary object, which will hold phrases as keys and a commaseparated
list of numbers, representing the sentences in which the phrases
appear, as the values. For each sentence in the array of consecutive sentences, split the sentence
wherever a stopword appears, and put the resulting phrases into an array. For each array of phrases, from each sentence, parse through the array of
phrases, assigning each phrase to a dictionary key, and concatenating the sentence
number in which the phrase occurs, to the comma-separated list of sentence
numbers that serves as the value for the key (phrase)*

In [48]:
norm = str.maketrans('','',string.printable)
badascii = str()
badascii = badascii.translate(norm)
badascii_table = badascii + (256 - len(badascii))*" "
junk_table = 256*" "
table = str.maketrans(badascii_table,junk_table)
count = 0
for item in sentence_list:
    count = count + 1
    count_string = str(count)
    item = item.lower()
    item = re.sub(r'\'s', "", item)
    item = item.translate(table)
    for stopword in stop_list:
        stopword = stopword.rstrip()
        item = re.sub(r' *\b' + stopword + r'\b *', '\n', item)
    item_list = item.split("\n")
    for phrase in item_list:
        phrasematch = re.match(r'^[0-9]', phrase)
        if (phrasematch):
            continue
        if phrase in item_dictionary:
            item_dictionary[phrase] = item_dictionary[phrase] + ',' + count_string
        else:
            item_dictionary[phrase] = count_string
keylist = item_dictionary.keys()
keylist = sorted(keylist)
for key in keylist:
    print(key, item_dictionary[key])

 2,4,4,4,4,6,6,6,6,6,6,7,7,7,7,10,10,11,11,11,11,11,11,11,11,12,12,12,12,12,12,13,15,16,16,17,18,18,18,18,19,19,19,19,19,19,19,20,20,20,20,22,22,24,25,25,26,26,26,26,26,26,27,28,28,28,28,29,29,30,30,30,31,31,31,32,32,32,33,33,33,34,34,34,34,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,37,37,37,37,37,38,39,39,39,40,41,42,42,43,44,44,44,44,45,45,45,45,45,45,45,45,45,45,45,45,45,46,47,47,47,47,48,48,48,48,48,48,49,49,49,49,49,50,50,50,50,50,50,50,50,50,50,51,51,51
(6) fatal (abbreviations 22
(umls), extracting 163666 abbreviation/expansion pairs.3 35
, liu et al3 studied 35
- 18
: ap = anterior-posterior; l = left 43
abbreviation classes 11
abbreviation list contained 12097 terms; 5772 abbreviations 15
abbreviations 2,6,7,18,26,30,32,32,33,34,35,35,42,44,48,50,50
abbreviations classed 22
abbreviations ended 20
abbreviations fell 9
abbreviations reveals 34
abbreviations, 48
abbreviations.4 35
accurate algorithms 32
accurate lists 32
acronyms 39,40
adjectives 42
algorithmic approac

**This section is adapted from section 4.4.1, "Script Algorithm", of page 65 from "Methods in Medical Informatics".*

## Analysis: Preparing an Index

An example of the kind of output produced by the script is shown

`adjustment 7,9
adjuvant chemotherapy 83
adjuvant imrt 23
`

The numbers represent the sentence numbers in which each phrase occurs. AUtomated indexing invariably produces a product that a human indexer can improve. The strength of automatic indexing is found when the texts are very long. Humans cannot index long texts. A flawed computer-generated index is usually better than no index at all*

**This section is adapted from section 4.4.2, "Analysis", of page 68 from "Methods in Medical Informatics".*

# 4.5 Comparing Texts Using Similarity Scores

When you have extracted all of the phrases occurring in a text, you have created something akin to the signature of the text. We can then determine whether two different text are similar, when we compare their signatures. Similarity scores are very useful in medical science. We can use similarity scores to establish relatedness of objects (ie. DNA sequences), to find trends and outliers in population data, to provide "best-fit" search results, and to classify groups of items. This script will demonstrate calculating the similarity between two documents using Pearson correlation.*

> This script will utilized the text files [STOP.TXT](./K11946_Files/STOP.TXT), [paradise.txt](./K11946_Files/paradise.txt), and [treasure.txt](./K11946_Files/treasure.txt). STOP.TXT contains a list of common stopwords. paradise.txt contains the novel **Paradise Lost** in text format. treasure.txt contains the novel **Treasure Island** in text format. More information [here](https://datamine.unc.edu/datafiles_yuchenh/)


**This section is adapted from page 69 of "Methods in Medical Informatics".*

In [None]:
import re
import string
from math import sqrt
from math import pow
treasure = {}
paradise = {}
filelist = ["./K11946_Files/treasure.txt", "./K11946_Files/paradise.txt"]
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
phraseform = re.compile(r'^[a-z]+ [a-z ]+$')
for filename in filelist:
    in_text = open(filename, "r", encoding="utf-8")
    in_text_string = in_text.read()
    in_text.close()
    in_text_string = in_text_string.replace("\n"," ")
    for stopword in stop_list:
        stopword = stopword.rstrip()
        in_text_string = re.sub(r' *\b' + stopword + r'\b *', '\n',in_text_string)
    in_text_string = re.sub(r'[\,\:\;\(\)]','\n',in_text_string)
    in_text_string = re.sub(r'[\.\!\?] +(?=[A-Z])', '\n', in_text_string)
    in_text_string = in_text_string.lower()
    item_list = re.split(r' *\n *', in_text_string)
    for phrase in item_list:
        phrase = re.sub(r' +',' ', phrase)
        phrase = phrase.strip()
        phrasematch = phraseform.match(phrase)
        if not (phrasematch):
            continue
        if (filename == "./K11946_Files/paradise.txt"):
            if phrase in paradise:
                paradise[phrase] = paradise[phrase] + 1
            else:
                paradise[phrase] = 1
            if not (phrase in treasure):
                treasure[phrase] = 0
        if (filename == "./K11946_Files/treasure.txt"):
            if phrase in treasure:
                treasure[phrase] = treasure[phrase] + 1
            else:
                treasure[phrase] = 1
            if not (phrase in paradise):
                paradise[phrase] = 0
count = 0; sumtally1 = 0; sumtally2 = 0; sqtally1 = 0; sqtally2 = 0
prodtally12 = 0; part1 = 0; part2 = 0; part3 = 0;
keylist = paradise.keys()
for key in keylist:
    count = count + 1;
    sumtally1 = sumtally1 + paradise[key]
    sumtally2 = sumtally2 + treasure[key]
    sqtally1 = sqtally1 + pow(paradise[key],2)
    sqtally2 = sqtally2 + pow(treasure[key],2)
    prodtally12 = prodtally12 + (paradise[key] * treasure[key])
part1 = prodtally12 - (float(sumtally1 * sumtally2) / count)
part2 = sqtally1 - (float(pow(sumtally1,2)) / count)
part3 = sqtally2 - (float(pow(sumtally2,2)) / count)
similarity12 = float(part1) / float(sqrt(part2 * part3))
print("The Pearson score is", similarity12)

## Script Algorithm: Comparing Texts Using Similarity Scores

We could compare any two documents, but for this exercise we chose
Stevenson’s Treasure Island and Milton’s Paradise Lost. The two novels represent
very different writing styles. The etext versions of these books are publicly available and can be downloaded from Project Gutenberg at the following
URLs:
<br>
<br>
For Paradise Lost:
<br>
https://www.gutenberg.org/ebooks/26
<br>
For Treasure Island:
<br>
http://www.gutenberg.org/etext/120

Put the names of each text file into an array. We will be performing the same
parsing steps on each of the two files.

In [None]:
import re
import string
from math import sqrt
from math import pow
treasure = {}
paradise = {}
filelist = ["./K11946_Files/treasure.txt", "./K11946_Files/paradise.txt"]

Open the STOP.TXT file, containing the high-frequency stopwords that we will
use to determine the boundaries of a phrase. (Remember: An index phrase is
a sequence of words bounded on both sides by a stop word or by the beginning
or the end of a sentence.) The stop file consists of one word per file line.
Put all of the words from the STOP.TXT file into an array, stripping the newline
character that separates each stop word from the subsequent stop word.

In [None]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open the first text file (Paradise Lost), and read the entire text into a
string variable. Delete every newline character from the text file string, replacing it with a
space character. In the text file string, wherever there is a sequence of words bounded on either
side by a stopword, replace the stopwords with a newline character. Iterate
this determination and replacement, over the entire text file string, for every
stopword in our array of stop words. Wherever there is a “,”, “:”, “;”, “(“ or ”)” in the text file string, replace the punctuation
with a newline character. We do this because these punctuation marks
delineate the beginning and the end of an expression and, for the purposes of
delineating index phrases, these punctuation marks are equivalent to an endof-
sentence marker. Wherever the text file string has a “.”, or “!” or “?” followed by one or more spaces,
followed by an uppercase letter, replace the punctuation and the following white
spaces with a newline character. We do this because the pattern is typical of a
sentence ending, and sentence endings mark the end of index phrases. Convert the modified text file string, which now marks the beginning and
ending of index phrases with newline characters, into lowercase.
Convert the modified text file string, replacing all sequences consisting of
multiple space characters with a single space character.
Split the text file string into an array, at every occurrence of a newline character
bordered by zero or more spaces. This results in an array that includes all
of the index phrases in the original text file.
Iterate through every phrase in the newly created array of index phrases.
For each phrase, if the phrase does not match a sequence of lowercase letters
followed by a space followed by a sequence of lowercase letters or spaces, skip
to the next item in the phrase array. We do this primarily to eliminate single
word phrases that do not contain a space intervening between words. This step
also eliminates phrases that contain numeric and nonalphabet characters.
We will be using two dictionary objects: the dictionary object consisting of all
of the index phrases from Paradise Lost as keys, and the number of occurrences
of each index phrase in Paradise Lost as the values, as well as the index phrases
that occur exclusively in Treasure Island, all with the number “0” as the value.
The other dictionary object will consist of the index phrases from Treasure
Island as keys, and the number of occurrences of each index phrase from
Treasure Island, as the values, as well as the index phrases that occur exclusively
in Paradise Lost, all with the number “0” as the value. By creating these two
dictionary objects, we create two dictionary objects that have the same matching
set of keys, with one set of keys holding the number of occurrences of the
keys in Paradise Lost, and the other holding the number of occurrences of the
keys in Treasure Island. We can then compare each dictionary object key by
key and value by value. To create the two dictionary objects, increment each occurrence of a phrase
by one in the dictionary object for the text file in which it has occurred, and
create a key–value pair in the other text file’s dictionary object (if none exists)
consisting of the phrase and the value “0”. Repeat steps 4 to 15 for the second book, Treasure Island. When you have
repeated these steps for the second book you will have collected the two
dictionary objects that you will use to compute the Pearson score. At this
point, you could substitute any similarity correlation scores you prefer over the
Pearson score.

In [None]:
phraseform = re.compile(r'^[a-z]+ [a-z ]+$')
for filename in filelist:
    in_text = open(filename, "r", encoding="utf-8")
    in_text_string = in_text.read()
    in_text.close()
    in_text_string = in_text_string.replace("\n"," ")
    for stopword in stop_list:
        stopword = stopword.rstrip()
        in_text_string = re.sub(r' *\b' + stopword + r'\b *', '\n',in_text_string)
    in_text_string = re.sub(r'[\,\:\;\(\)]','\n',in_text_string)
    in_text_string = re.sub(r'[\.\!\?] +(?=[A-Z])', '\n', in_text_string)
    in_text_string = in_text_string.lower()
    item_list = re.split(r' *\n *', in_text_string)
    for phrase in item_list:
        phrase = re.sub(r' +',' ', phrase)
        phrase = phrase.strip()
        phrasematch = phraseform.match(phrase)
        if not (phrasematch):
            continue
        if (filename == "./K11946_Files/paradise.txt"):
            if phrase in paradise:
                paradise[phrase] = paradise[phrase] + 1
            else:
                paradise[phrase] = 1
            if not (phrase in treasure):
                treasure[phrase] = 0
        if (filename == "./K11946_Files/treasure.txt"):
            if phrase in treasure:
                treasure[phrase] = treasure[phrase] + 1
            else:
                treasure[phrase] = 1
            if not (phrase in paradise):
                paradise[phrase] = 0

Parse over every key–value pair in either dictionary object (we chose the dictionary
object for Paradise Lost, but the calculation, which depends on differences
between the two dictionary objects, would yield the same score using
either dictionary object). Keep a count of the total number of key–value pairs. Produce a summation tally of the values in the Paradise Lost dictionary object
and in the Treasure Island dictionary object. Produce a summation tally of the squares of the values in the Paradise Lost dictionary
object and the squares of the values in the Treasure Island dictionary object. Produce a summation tally of the products of each value in the Paradise Lost
dictionary object multiplied by the corresponding value (the value of the same
key) in the Treasure Island dictionary object.

In [None]:
count = 0; sumtally1 = 0; sumtally2 = 0; sqtally1 = 0; sqtally2 = 0
prodtally12 = 0; part1 = 0; part2 = 0; part3 = 0;
keylist = paradise.keys()
for key in keylist:
    count = count + 1;
    sumtally1 = sumtally1 + paradise[key]
    sumtally2 = sumtally2 + treasure[key]
    sqtally1 = sqtally1 + pow(paradise[key],2)
    sqtally2 = sqtally2 + pow(treasure[key],2)
    prodtally12 = prodtally12 + (paradise[key] * treasure[key])

After the dictionary object is parsed, you will take the tally variables that you
just computed, and you will insert them into the Pearson formula.
The Pearson score is the summation tally of the products minus the sum tally
of the first dictionary object times the sum tally of the second dictionary object
divided by the number of keys in the object all divided by the square root of
the tally of the squares of the values of the Paradise Lost dictionary object
times the square of the sum tally of Paradise Lost dictionary object divided
by the number of keys in the object, times the tally of the squares of the values
of the Treasure Island dictionary object times the square of the sum tally of
Treasure Island dictionary object divided by the number of keys in the object.
Step 23 is an example where the description of a mathematical expression, in
English, is much, much more confusing than the program code for the mathematical
expression.*

In [None]:
part1 = prodtally12 - (float(sumtally1 * sumtally2) / count)
print(part1)
part2 = sqtally1 - (float(pow(sumtally1,2)) / count)
part3 = sqtally2 - (float(pow(sumtally2,2)) / count)
similarity12 = float(part1) / float(sqrt(part2 * part3))
print("The Pearson score is", similarity12)

**This section is adapted from section 4.5.1, "Script Algorithm", of pages 69-70 from "Methods in Medical Informatics".*

## Analysis: Comparing Texts Using Similarity Scores

Pearson scores range from -1 to 1. A score of 1 occurs when a document is compared against itself. When we compute the Pearson score between two highly dissimilar texts, the yielded score is -0.38257. We expected and received a low-end Pearson score.*

**This section is adapted from section 4.5.2, "Analysis", of page 76 from "Methods in Medical Informatics".*