Welcome to chapter four of Methods in Medical Informatics! Book indexing consists of collecting significant words and their associated page numbers. A similar organization process can be applied to online text to improve organization and text processing speeds. We will be exploring scripts that demonstrate computational text indexing. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# ZIPF Distribution of a Text File

In almost every segment of life, a small number of items usually account for the bulk of the observable activities. This pattern also hold true for words that compose a text. This phenomenon is known as Zipf's law as a mathematical description. You can write a script to illustrate the Zipf distribution for any text.*

> This script will utilzied the [d2020.bin](https://datamine.unc.edu/data-files/). This is a binary file which contains tens of thousands of MeSH terms. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from pages 53-54 of "Methods in Medical Informatics"*

In [None]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

## Script Algorithm: Zipf Distribution of a Text File

Call the necessary packages*

In [None]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}

Open the necessary file to read and create a new file, meshzipf.txt, which will receive the output of the zipf distribution

In [None]:
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")

Parse the string, matching against each occurrence of a latter followed by at least 2, and at most 15, lowercase letters, with the sequence bounded on either size by a word boundary. 

In [None]:
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""

Create a dictionary object that will include words (keys) and number of occurrences (values)

In [None]:
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1

After the dictionary object is complete, format the values in the dictionary, as a zero-padded string of uniform length. 

In [None]:
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]

Sort the key-value pairs by values, descending. Print out sorted key-value pairs

In [None]:
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

**This section is adapted from section 4.1.1, "Script Algorithm", of page 54 from "Methods in Medical Informatics".*

## Analysis: Zipf Distribution of a Text File

The top entries from the MeSH file are:

`036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that`

For these scripts, the entire content of a file is loaded into a string variable. This variable is subsequently parsed into words, with each occurrence of the word counted. If the file is very large, the script can be modified to read the file line by line, incrementing the word/frequency tally for th words contained in each line. At the top of the Zipf list are the high-frequency words, such as “the”, “and”, and “was” that serve as connectors for lower-frequency, highly specific terms. Also included at the top of the Zipf list are frequently recurring letter sequences peculiar to the file; in this case, “abcdef” and “abbcdef”. Zipf distributions have many uses in informatics projects, including the preparation of “stopword” lists.*

**This section is adapted from section 4.1.2, "Analysis", of page 56 in "Methods in Medical Informatics".*

# Preparing a Concordance

A concordance is a special type of index, listing every location of every word in the text. Concordances can be used to support very fast proximity searches (finding the locations of words in proximity to other words), and phrase searches (finding sequences of words located in an ordered sequence somewhere in the text. Using only a concordance, it is a simple matter to computationally recreate the entire text. Preparing a concordance is quite simple.*

> This script will utilized two text files, [STOP.TXT](https://datamine.unc.edu/data-files/) and [TITLES.TXT](https://datamine.unc.edu/data-files/). STOP.TXT contains a list of stopwords. TITLES.TXT contains a list of 100 titles of journal articles. More information available [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 57 of "Methods in Medical Informatics".*

In [None]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)
keylist = word_dict.keys()
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

## Script Algorithm: Preparing a Concordance

Import the necessary packages*

In [None]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0

Read the entire contents of the titles.txt file into a string variable

In [None]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()

Split the file into sentences

In [None]:
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Parse each sentence into an array of words

In [None]:
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)

Add the location of the word to the dictionary object that contains the encountered words and their locations

In [None]:
keylist = word_dict.keys()

Order the words alphabetically and print out each word in the dictionary object

In [None]:
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

**This section is adapted from section 4.2.1, "Script Algorithm", of page 57 from "Methods in Medical Informatics".*

## Analysis: Preparing a Concordance

The sample text consisted of 100 parsed sentences. Here are the first few lines of the output.*

`carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1`

**This section is adapted from section 4.2.2, "Analysis", of pages 59-60 from "Methods in Medical Informatics".*

# Extracting Phrases

All text is composed of words and phrases that represent specific concepts, that are connected together into a sequence of meaningful statements. One way to extract useful concepts is to remove common words or "stopwords". This script will demonstrate phrase extraction through stopword removal.*

> This script will utilized the text files [STOP.TXT](https://datamine.unc.edu/data-files/) and [cancer_gene_titles.txt](https://datamine.unc.edu/data-files/). STOP.TXT contains a list of common stopwords. cancer_gene_titles.txt contains a list of cancer-related journal articles extracted from a PubMed query. More information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 60 of "Methods in Medical Informatics".*

In [None]:
import re, string
item_list = []
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")
for item in item_list:
    print(out_text, item)

## Script Algorithm: Extracting Phrases

Call necessary packages*

In [None]:
import re, string
item_list = []

Open the stop.txt file, containing a list of common stopwords. Split into a list structure

In [None]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open cancer_gene_titles.txt

In [None]:
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0

Pare through the lines of the text. Substittue a newline character for every occurrence of any stopword in the sentence.

In [None]:
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")

Sort item alphabetically and print

In [None]:
for item in item_list:
    print(out_text, item)

**This section is adapted from section 4.3.1, "Script Algorithm", of page 61 from "Methods in Medical Informatics".*

## Analysis: Extracting Phrases

The output is an alphabetic file of the phrases that might appear in a book's index. We used the file consisting of titles from a PubMed search. This file, cancer_gene_titles.txt, is about 1.1 MB in length, the size of a typical book. We only required about a dozen lines of code and a few seconds of execution time to create out list of index terms.*

**This section is adapted from section 4.3.2, "Analysis", of page 63 from "Methods in Medical Informatics".*

# Preparing an Index

An index is a list of the important words or phrases contained in a book, along with the locations where each of those words and phrases can be found. This is different from concordance because the index does not contain every word found in the text, and the index contains groups of selected phrases, in addition to individual words. Software can be used to create indexes. However, remember that a useful index is more selective than simply recording the location of every word and phrase.*

> This script will utilized the text files [STOP.TXT](https://datamine.unc.edu/data-files/) and [TEXT.txt](https://datamine.unc.edu/data-files/). STOP.TXT contains a list of common stopwords. TEXT.txt contains a sample journal article. More information [here](https://datamine.unc.edu/data-files/)


**Description adapted from page 63-64 of "Methods in Medical Informatics".*

In [None]:
import re
import string
item_list = []
item_dictionary = {}
place_string = ""
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TEXT.TXT', 'r')
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
norm = str.maketrans('','',string.printable)
badascii = str()
badascii = badascii.translate(norm)
badascii_table = badascii + (256 - len(badascii))*" "
junk_table = 256*" "
table = str.maketrans(badascii_table,junk_table)
count = 0
for item in sentence_list:
    count = count + 1
    count_string = str(count)
    item = item.lower()
    item = re.sub(r'\'s', "", item)
    item = item.translate(table)
    for stopword in stop_list:
        stopword = stopword.rstrip()
        item = re.sub(r' *\b' + stopword + r'\b *', '\n', item)
    item_list = item.split("\n")
    for phrase in item_list:
        phrasematch = re.match(r'^[0-9]', phrase)
        if (phrasematch):
            continue
        if phrase in item_dictionary:
            item_dictionary[phrase] = item_dictionary[phrase] + ',' + count_string
        else:
            item_dictionary[phrase] = count_string
keylist = item_dictionary.keys()
keylist = sorted(keylist)
for key in keylist:
    print(key, item_dictionary[key])

## Script Algorithm: Preparing an Index

Create an array containing stopwords. You can use any stopword list you prefer.
In this script, we use stop.txt available at http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/K11946_Files/STOP.TXT

In [None]:
import re
import string
item_list = []
item_dictionary = {}
place_string = ""
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open a file to be indexed. You can use any file, but in this text, we use text.
txt, available at http://www.julesberman.info/book/text.txt

In [None]:
in_text = open('./K11946_Files/TEXT.TXT', 'r')
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Strip the text of any non-ASCII characters (not necessary if you are using a
plain-text file).

In [None]:
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Split the text into sentences and put the consecutive sentences into an array.

In [None]:
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Create a dictionary object, which will hold phrases as keys and a commaseparated
list of numbers, representing the sentences in which the phrases
appear, as the values. For each sentence in the array of consecutive sentences, split the sentence
wherever a stopword appears, and put the resulting phrases into an array. For each array of phrases, from each sentence, parse through the array of
phrases, assigning each phrase to a dictionary key, and concatenating the sentence
number in which the phrase occurs, to the comma-separated list of sentence
numbers that serves as the value for the key (phrase)*

In [None]:
norm = str.maketrans('','',string.printable)
badascii = str()
badascii = badascii.translate(norm)
badascii_table = badascii + (256 - len(badascii))*" "
junk_table = 256*" "
table = str.maketrans(badascii_table,junk_table)
count = 0
for item in sentence_list:
    count = count + 1
    count_string = str(count)
    item = item.lower()
    item = re.sub(r'\'s', "", item)
    item = item.translate(table)
    for stopword in stop_list:
        stopword = stopword.rstrip()
        item = re.sub(r' *\b' + stopword + r'\b *', '\n', item)
    item_list = item.split("\n")
    for phrase in item_list:
        phrasematch = re.match(r'^[0-9]', phrase)
        if (phrasematch):
            continue
        if phrase in item_dictionary:
            item_dictionary[phrase] = item_dictionary[phrase] + ',' + count_string
        else:
            item_dictionary[phrase] = count_string
keylist = item_dictionary.keys()
keylist = sorted(keylist)
for key in keylist:
    print(key, item_dictionary[key])

**This section is adapted from section 4.4.1, "Script Algorithm", of page 65 from "Methods in Medical Informatics".*

## Analysis: Preparing an Index

An example of the kind of output produced by the script is shown

`adjustment 7,9
adjuvant chemotherapy 83
adjuvant imrt 23
`

The numbers represent the sentence numbers in which each phrase occurs. AUtomated indexing invariably produces a product that a human indexer can improve. The strength of automatic indexing is found when the texts are very long. Humans cannot index long texts. A flawed computer-generated index is usually better than no index at all*

**This section is adapted from section 4.4.2, "Analysis", of page 68 from "Methods in Medical Informatics".*

# Comparing Texts Using Similarity Scores

When you have extracted all of the phrases occurring in a text, you have created something akin to the signature of the text. We can then determine whether two different text are similar, when we compare their signatures. Similarity scores are very useful in medical science. We can use similarity scores to establish relatedness of objects (ie. DNA sequences), to find trends and outliers in population data, to provide "best-fit" search results, and to classify groups of items. This script will demonstrate calculating the similarity between two documents using Pearson correlation.*

> This script will utilized the text files [STOP.TXT](https://datamine.unc.edu/data-files/), [paradise.txt](https://datamine.unc.edu/data-files/), and [treasure.txt](https://datamine.unc.edu/data-files/). STOP.TXT contains a list of common stopwords. paradise.txt contains the novel **Paradise Lost** in text format. treasure.txt contains the novel **Treasure Island** in text format. More information [here](https://datamine.unc.edu/data-files/)


**This section is adapted from page 69 of "Methods in Medical Informatics".*

In [None]:
import re
import string
from math import sqrt
from math import pow
treasure = {}
paradise = {}
filelist = ["./K11946_Files/treasure.txt", "./K11946_Files/paradise.txt"]
stopfile = open("./K11946_Files/stop.txt",'r')
stop_list = stopfile.readlines()
stopfile.close()
phraseform = re.compile(r'^[a-z]+ [a-z ]+$')
for filename in filelist:
    in_text = open(filename, "r", encoding="utf-8")
    in_text_string = in_text.read()
    in_text.close()
    in_text_string = in_text_string.replace("\n"," ")
    for stopword in stop_list:
        stopword = stopword.rstrip()
        in_text_string = re.sub(r' *\b' + stopword + r'\b *', '\n',in_text_string)
    in_text_string = re.sub(r'[\,\:\;\(\)]','\n',in_text_string)
    in_text_string = re.sub(r'[\.\!\?] +(?=[A-Z])', '\n', in_text_string)
    in_text_string = in_text_string.lower()
    item_list = re.split(r' *\n *', in_text_string)
    for phrase in item_list:
        phrase = re.sub(r' +',' ', phrase)
        phrase = phrase.strip()
        phrasematch = phraseform.match(phrase)
        if not (phrasematch):
            continue
        if (filename == "./K11946_Files/paradise.txt"):
            if phrase in paradise:
                paradise[phrase] = paradise[phrase] + 1
            else:
                paradise[phrase] = 1
            if not (phrase in treasure):
                treasure[phrase] = 0
        if (filename == "./K11946_Files/treasure.txt"):
            if phrase in treasure:
                treasure[phrase] = treasure[phrase] + 1
            else:
                treasure[phrase] = 1
            if not (phrase in paradise):
                paradise[phrase] = 0
count = 0; sumtally1 = 0; sumtally2 = 0; sqtally1 = 0; sqtally2 = 0
prodtally12 = 0; part1 = 0; part2 = 0; part3 = 0;
keylist = paradise.keys()
for key in keylist:
    count = count + 1;
    sumtally1 = sumtally1 + paradise[key]
    sumtally2 = sumtally2 + treasure[key]
    sqtally1 = sqtally1 + pow(paradise[key],2)
    sqtally2 = sqtally2 + pow(treasure[key],2)
    prodtally12 = prodtally12 + (paradise[key] * treasure[key])
part1 = prodtally12 - (float(sumtally1 * sumtally2) / count)
part2 = sqtally1 - (float(pow(sumtally1,2)) / count)
part3 = sqtally2 - (float(pow(sumtally2,2)) / count)
similarity12 = float(part1) / float(sqrt(part2 * part3))
print("The Pearson score is", similarity12)

## Script Algorithm: Comparing Texts Using Similarity Scores

We could compare any two documents, but for this exercise we chose
Stevenson’s Treasure Island and Milton’s Paradise Lost. The two novels represent
very different writing styles. The etext versions of these books are publicly available and can be downloaded from Project Gutenberg at the following
URLs:
<br>
<br>
For Paradise Lost:
<br>
http://www.gutenberg.org/dirs/etext91/plboss10.txt
<br>
For Treasure Island:
<br>
http://www.gutenberg.org/etext/120

Put the names of each text file into an array. We will be performing the same
parsing steps on each of the two files.

In [None]:
import re
import string
from math import sqrt
from math import pow
treasure = {}
paradise = {}
filelist = ["./K11946_Files/treasure.txt", "./K11946_Files/paradise.txt"]

Open the stop.txt file, containing the high-frequency stopwords that we will
use to determine the boundaries of a phrase. (Remember: An index phrase is
a sequence of words bounded on both sides by a stop word or by the beginning
or the end of a sentence.) The stop file consists of one word per file line.
Put all of the words from the stop.txt file into an array, stripping the newline
character that separates each stop word from the subsequent stop word.

In [None]:
stopfile = open("./K11946_Files/stop.txt",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open the first text file (Paradise Lost), and read the entire text into a
string variable. Delete every newline character from the text file string, replacing it with a
space character. In the text file string, wherever there is a sequence of words bounded on either
side by a stopword, replace the stopwords with a newline character. Iterate
this determination and replacement, over the entire text file string, for every
stopword in our array of stop words. Wherever there is a “,”, “:”, “;”, “(“ or ”)” in the text file string, replace the punctuation
with a newline character. We do this because these punctuation marks
delineate the beginning and the end of an expression and, for the purposes of
delineating index phrases, these punctuation marks are equivalent to an endof-
sentence marker. Wherever the text file string has a “.”, or “!” or “?” followed by one or more spaces,
followed by an uppercase letter, replace the punctuation and the following white
spaces with a newline character. We do this because the pattern is typical of a
sentence ending, and sentence endings mark the end of index phrases. Convert the modified text file string, which now marks the beginning and
ending of index phrases with newline characters, into lowercase.
Convert the modified text file string, replacing all sequences consisting of
multiple space characters with a single space character.
Split the text file string into an array, at every occurrence of a newline character
bordered by zero or more spaces. This results in an array that includes all
of the index phrases in the original text file.
Iterate through every phrase in the newly created array of index phrases.
For each phrase, if the phrase does not match a sequence of lowercase letters
followed by a space followed by a sequence of lowercase letters or spaces, skip
to the next item in the phrase array. We do this primarily to eliminate single
word phrases that do not contain a space intervening between words. This step
also eliminates phrases that contain numeric and nonalphabet characters.
We will be using two dictionary objects: the dictionary object consisting of all
of the index phrases from Paradise Lost as keys, and the number of occurrences
of each index phrase in Paradise Lost as the values, as well as the index phrases
that occur exclusively in Treasure Island, all with the number “0” as the value.
The other dictionary object will consist of the index phrases from Treasure
Island as keys, and the number of occurrences of each index phrase from
Treasure Island, as the values, as well as the index phrases that occur exclusively
in Paradise Lost, all with the number “0” as the value. By creating these two
dictionary objects, we create two dictionary objects that have the same matching
set of keys, with one set of keys holding the number of occurrences of the
keys in Paradise Lost, and the other holding the number of occurrences of the
keys in Treasure Island. We can then compare each dictionary object key by
key and value by value. To create the two dictionary objects, increment each occurrence of a phrase
by one in the dictionary object for the text file in which it has occurred, and
create a key–value pair in the other text file’s dictionary object (if none exists)
consisting of the phrase and the value “0”. Repeat steps 4 to 15 for the second book, Treasure Island. When you have
repeated these steps for the second book you will have collected the two
dictionary objects that you will use to compute the Pearson score. At this
point, you could substitute any similarity correlation scores you prefer over the
Pearson score.

In [None]:
phraseform = re.compile(r'^[a-z]+ [a-z ]+$')
for filename in filelist:
    in_text = open(filename, "r", encoding="utf-8")
    in_text_string = in_text.read()
    in_text.close()
    in_text_string = in_text_string.replace("\n"," ")
    for stopword in stop_list:
        stopword = stopword.rstrip()
        in_text_string = re.sub(r' *\b' + stopword + r'\b *', '\n',in_text_string)
    in_text_string = re.sub(r'[\,\:\;\(\)]','\n',in_text_string)
    in_text_string = re.sub(r'[\.\!\?] +(?=[A-Z])', '\n', in_text_string)
    in_text_string = in_text_string.lower()
    item_list = re.split(r' *\n *', in_text_string)
    for phrase in item_list:
        phrase = re.sub(r' +',' ', phrase)
        phrase = phrase.strip()
        phrasematch = phraseform.match(phrase)
        if not (phrasematch):
            continue
        if (filename == "./K11946_Files/paradise.txt"):
            if phrase in paradise:
                paradise[phrase] = paradise[phrase] + 1
            else:
                paradise[phrase] = 1
            if not (phrase in treasure):
                treasure[phrase] = 0
        if (filename == "./K11946_Files/treasure.txt"):
            if phrase in treasure:
                treasure[phrase] = treasure[phrase] + 1
            else:
                treasure[phrase] = 1
            if not (phrase in paradise):
                paradise[phrase] = 0

Parse over every key–value pair in either dictionary object (we chose the dictionary
object for Paradise Lost, but the calculation, which depends on differences
between the two dictionary objects, would yield the same score using
either dictionary object). Keep a count of the total number of key–value pairs. Produce a summation tally of the values in the Paradise Lost dictionary object
and in the Treasure Island dictionary object. Produce a summation tally of the squares of the values in the Paradise Lost dictionary
object and the squares of the values in the Treasure Island dictionary object. Produce a summation tally of the products of each value in the Paradise Lost
dictionary object multiplied by the corresponding value (the value of the same
key) in the Treasure Island dictionary object.

In [None]:
count = 0; sumtally1 = 0; sumtally2 = 0; sqtally1 = 0; sqtally2 = 0
prodtally12 = 0; part1 = 0; part2 = 0; part3 = 0;
keylist = paradise.keys()
for key in keylist:
    count = count + 1;
    sumtally1 = sumtally1 + paradise[key]
    sumtally2 = sumtally2 + treasure[key]
    sqtally1 = sqtally1 + pow(paradise[key],2)
    sqtally2 = sqtally2 + pow(treasure[key],2)
    prodtally12 = prodtally12 + (paradise[key] * treasure[key])

After the dictionary object is parsed, you will take the tally variables that you
just computed, and you will insert them into the Pearson formula.
The Pearson score is the summation tally of the products minus the sum tally
of the first dictionary object times the sum tally of the second dictionary object
divided by the number of keys in the object all divided by the square root of
the tally of the squares of the values of the Paradise Lost dictionary object
times the square of the sum tally of Paradise Lost dictionary object divided
by the number of keys in the object, times the tally of the squares of the values
of the Treasure Island dictionary object times the square of the sum tally of
Treasure Island dictionary object divided by the number of keys in the object.
Step 23 is an example where the description of a mathematical expression, in
English, is much, much more confusing than the program code for the mathematical
expression.*

In [None]:
part1 = prodtally12 - (float(sumtally1 * sumtally2) / count)
part2 = sqtally1 - (float(pow(sumtally1,2)) / count)
part3 = sqtally2 - (float(pow(sumtally2,2)) / count)
similarity12 = float(part1) / float(sqrt(part2 * part3))
print("The Pearson score is", similarity12)

**This section is adapted from section 4.5.1, "Script Algorithm", of pages 69-70 from "Methods in Medical Informatics".*

## Analysis: Comparing Texts Using Similarity Scores

Pearson scores range from -1 to 1. A score of 1 occurs when a document is compared against itself. When we compute the Pearson score between two highly dissimilar texts, the yielded score is -0.38257. We expected and received a low-end Pearson score.*

**This section is adapted from section 4.5.2, "Analysis", of page 76 from "Methods in Medical Informatics".*