# Tutorial 4: Convert text into sparse form
The aim of this tutorial is to learn how to convert unstructured text into a spare or structured form. The text data that you are going to process contains 16 clinical reports, each of which is stored in the following format
 <img src="pic_1.png" height="500" width="700">
where sentences are separated by an empty line, and a sequence of 10 asterisks ("*") is used to indicate segment boundaries. Your task is to convert the text files into a proper format so that text segmentation algorithms can take the preprocessed text as input.

The preprocessing task is to generate the following files:
* <b>clinical_voc.txt</b> contains the vocabulary generated from the 16 clinical reports. It should look like 
<img src="pic_2.png" height="100" width="300">
* <b>clinical.txt</b> contains the preprocessed clinical reports. Each line contains a file name, a sentence index, a word index and its TF-IDF score.
The file you are going to generate should look like 
<img src="pic_3.png" height="100" width="300">
the word index is the index of the word in the vocabulary, and the real value is the TF-IDF score of the word in the sentence.
* <b>clinical_boundaries.txt</b> contains the segmentation of each clinical reports. 
<img src="pic_4.png" height="300" width="500">
where each line starts with the name of the clinical report, followed by a binary vector. "1" indicates there is a boundary after the current sentence, "0" otherwise.

The sample output files are provided, which are
* <b>clinical_voc_sample.txt</b>
* <b>clinical_sample.txt</b>
* <b>clinical_boundaries_sample.txt</b>

Your task is to write your own Python code to produce exactly the same files.

In order to generate the three files, you are going to think about how to put some of the techniques covered in week 3 together. For example, we need
* Word tokenization
* Case Normalization
* Stopword removal

and other techniques from lecture materials in week 4. Please finish the following tasks in this tutorial.

Note that for demonstration purpose, we have split the code into difference chunks in difference cells.

In [1]:
import os
import nltk

In [2]:
text_file_folder = "./data"

## Task 1: Read the data files 

The first task is to read the 16 clinical reports stored in the folder <font color="red">"./data"</font>. The file extension used is ".ref". The actual task contains:
for each file,
1. make sure the file's extension is ".ref" 
2. read all the sentences, and store them in list. The sentence order does matter. The blank lines between sentences should be removed.
3. read the segmentation boundaries. You should record where the boundaries are in terms of sentence indexes. For example, if there is a boundary after the 10th, 20th, 30th and 40th sentences, you should generate a list that looks like [9, 19, 29, 39].

In terms of data structure, you can use two dictionaries to store sentences and boundaries separately. For both dictionaries, the key must be the same, e.g., the name of each clinical report so that we can easily match the pre-processed report with the corresponding boundaries. 

You can also use one dictionary, where keys are the report file names, values are a pair of sentence list and boundary list. the following code implement the idea of using one dictionary. Please fill in the missing code.

In [3]:
sents_dict = {}
for root, subFolders, files in os.walk(text_file_folder):
    for file_name in files: # for each files in ./data
        file_path  = os.path.join(root, file_name)
        # mark sure the file extension is "ref". We try to excldue files such as
        # ".DS_Store" in MacOS.   
        if file_path.endswith('ref'):
            file_reader = open(file_path)
            i = 0 # count the total number of sentences in each file
            sents = []
            bds = []
            for line in file_reader.readlines():
            #########please write the missing code below#######
                line = line.strip();
                if line != '': # exclude the empty lines 
                    if line != "**********":
                        sents.append(line)
                        i = i + 1
                    # Lines only contain "**********" indicate segment boundaries.
                    # For text segmentation task, we need to record where the boundaries locate 
                    # in the text 
                    elif line == "**********": 
                        bds.append(i-1)
            ######################################################
            sents_dict[file_name] = (sents, bds)

It is always a good idea to check the output of your code, i.e., a sanity check. For example, print the list of sentences of "000.ref" and the list of boundaries generated about, and manually check them again the original text file "000.ref".

In [4]:
sents_dict['000.ref'][0]

['Physical diagnosis had its origins in Grecian medicine',
 'Clinical medicine flourished before the Greeks  especially in Egypt  Crete  and Babylonia  and undoubtedly the Greeks were influenced by these earlier physicians',
 'But writings from these countries did not become part of the mainstream of Western civilization  as did those of the Greeks',
 'Table contains two quotations that illustrate the level of medicine practiced by the Greeks',
 'They took a careful history and practiced direct auscultation',
 'They were masters of observation  their descriptions of patients could fit modern texts without much change',
 'Greek medicine flourished early',
 'Homer in the Iliad ca  b c',
 'described  wounds and used  anatomic terms',
 'Hippocrates ca  b c',
 'lived during the Golden Age of Greece',
 'His contemporaries included Plato  Socrates  Aeschylus  Sophocles  Euripides  Aristophanes  and Pericles',
 'Medicine became in his hands an art  a science  a profession Major',
 'The Hippocr

Then, have a look at the boundaries.

In [5]:
sents_dict['000.ref'][1]

[36, 90, 149, 213, 286, 355, 420, 463, 523, 559, 593, 623, 719, 798, 801]

## Task 2 Tokenize each sentence in each document

In this task, you are going to tokenize all the sentences for each clinical report. The tokenization function is provided as follows:

In [6]:
def tokenize_sent(sent):
    """
    The function tokenizes a sentence, and return a list of words that only contain alphabet 
    letters.
    """
    return [word for word in nltk.word_tokenize(sent.lower()) if word.isalpha()]

The above function uses the nltk's  built-in tokenizer to tokenize a given sentence. All the words are converted to lower cases and must only contain alphabet letters. Note that for simplification, we did the case normalization and just keep words with only alphabetic letters. However, for some tasks, like sentence segmentation and named entity recognition, it is good to keep the original form of the words. 

Now, you should write your code to tokenize all the sentences stored in <font color="orange">sents_dict</font>

In [7]:
tokenized_sents = {} #The key is the document name, the value is a list of tokenized sentences
#########please write the missing code below#######
for key, value in sents_dict.iteritems():
    tokenized_sents[key] = [tokenize_sent(sent) for sent in value[0]]

Similarly, you should check the out put of your code. Here, we print the first three tokenized sentences of "000.ref", and compare them with the original sentences in "000.ref".

In [8]:
print tokenized_sents['000.ref'][0]
print tokenized_sents['000.ref'][1]
print tokenized_sents['000.ref'][2]

['physical', 'diagnosis', 'had', 'its', 'origins', 'in', 'grecian', 'medicine']
['clinical', 'medicine', 'flourished', 'before', 'the', 'greeks', 'especially', 'in', 'egypt', 'crete', 'and', 'babylonia', 'and', 'undoubtedly', 'the', 'greeks', 'were', 'influenced', 'by', 'these', 'earlier', 'physicians']
['but', 'writings', 'from', 'these', 'countries', 'did', 'not', 'become', 'part', 'of', 'the', 'mainstream', 'of', 'western', 'civilization', 'as', 'did', 'those', 'of', 'the', 'greeks']


## Task 3 remove stop words

As we discussed in the lecture, stop words do not contribute much to the lexical content. In most text analysis tasks (e.g., IR, text classification, topic modeling), we choose to remove all the stop words. Here, we start with counting the frequency of each unique word in the 16 reports in order to demonstrate that stop-words actually dominate the text. The NLTK package that we are going to use is the <b><a ref="http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist">FreqDist</a></b> in the <font color="blue">nltk.probability</font>. To use <b>FreqDist</b>, we need to concatenate words in all the sentences of all the reports, and form a long list of tokens. The function for concatenating all the words is provides as follows:

In [9]:
from nltk.probability import *

def word_concat(dsd):
    """
    concatenate all the words stored in the values of a given dictionary. Each value is a list
    of tokenized sentences.
    """
    all_words = []
    for value in dsd.values():
        for sent in value:
            all_words += sent
    print "tokens:", len(all_words)
    print "types:", len(set(all_words))
    return all_words

Now, write your code the find the 20 most_common words in the code cell below.

In [10]:
#########please write the missing code below#######
freq_dist = FreqDist(word_concat(tokenized_sents))
freq_dist.most_common(20)

tokens: 61412
types: 6925


[('the', 4497),
 ('of', 2590),
 ('and', 1674),
 ('a', 1436),
 ('to', 1416),
 ('in', 1397),
 ('is', 1023),
 ('with', 637),
 ('or', 636),
 ('be', 631),
 ('that', 535),
 ('patient', 529),
 ('for', 491),
 ('as', 479),
 ('by', 450),
 ('may', 361),
 ('are', 348),
 ('this', 335),
 ('disease', 325),
 ('it', 303)]

You should find that nearly all the words are functional words in those top 20 most-common words, except for "patient" and "disease". However, the two words actually appear in every clinical report. We should also consider removing this type of words in the preprocessing. Here your task is to remove all the stop words in the following list:

In [11]:
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
stopwords = set(stopwords)

In [12]:
def remove_words(words, stops):
    """
    This function excludes all the words appearing in a given list.
    Here the list is named "stops"
    """
    #########please write the missing code below#######
    return [word for word in words if word not in stops]

Write your code below. Your code should use the above the <font color="blue">remove_words</font> function defined above.

In [13]:
tokenized_sents_stop = {}
#########please write the missing code below#######
for key, value in tokenized_sents.items():
    tokenized_sents_stop[key] = [remove_words(sent, stopwords) for sent in value]

After removing the stop words, you can find the difference of the most common words between before and after removing stopwords.

In [14]:
#########please write the missing code below#######
freq_dist = FreqDist(word_concat(tokenized_sents_stop))
freq_dist.most_common(20)

tokens: 30395
types: 6506


[('patient', 529),
 ('disease', 325),
 ('patients', 255),
 ('pain', 225),
 ('history', 208),
 ('test', 202),
 ('symptoms', 155),
 ('physician', 141),
 ('heart', 141),
 ('pressure', 126),
 ('physical', 126),
 ('chest', 124),
 ('syncope', 124),
 ('clinical', 121),
 ('blood', 118),
 ('cardiac', 113),
 ('examination', 113),
 ('diagnostic', 111),
 ('medical', 110),
 ('exercise', 106)]

## Task 4 Generate TF-IDF vector for each sentence
Now assume that we are also interested in computing the similarity between any two sentences in a vector space. Instead of using counts, we consider the use of TF-IDF vectors. In this task, you are going to use the <a ref="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer">TfidfVectorizer</a> in the <a ref="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text">sklearn.feature_extraction.text </a> package. It is also necessary to read the <a ref="http://scikit-learn.org/stable/modules/feature_extraction.html"> Feature Extraction</a> tutorial on the sklearn website.

In [15]:
sents_ids = []
sents_words =[]
for key, value in tokenized_sents_stop.items():
    i = 0
    for sent in value:
        #print key
        sents_ids.append("{0},{1}".format(key, i))
        txt = ' '.join(sent)
        sents_words.append(txt)
        i = i + 1
sents_words

['clinicians rate patient medical history greater diagnostic physical examination results laboratory investigations rich',
 'clinical adage thirds diagnoses made basis history retained validity technological advances modern hospital',
 'accurate history focus physical examination making productive time efficient',
 'clinical hypotheses generated interview provide basis cost effective utilization clinical laboratory diagnostic modalities',
 'diagnostic utility interview complemented therapeutic power',
 'medium positive relationship established doctor patient empathic patient centered interview bolster patient sense esteem lessen feelings helplessness accompany episode illness',
 'therapeutic alliance forged clinical encounter foundation ongoing patient care education',
 'student medical interview differs conversations special skills required',
 'sense direction distinguishes medical interview casual conversations social encounters',
 'fundamentally medical interview purposeful conversa

In [16]:
sents_ids

['002.ref,0',
 '002.ref,1',
 '002.ref,2',
 '002.ref,3',
 '002.ref,4',
 '002.ref,5',
 '002.ref,6',
 '002.ref,7',
 '002.ref,8',
 '002.ref,9',
 '002.ref,10',
 '002.ref,11',
 '002.ref,12',
 '002.ref,13',
 '002.ref,14',
 '002.ref,15',
 '002.ref,16',
 '002.ref,17',
 '002.ref,18',
 '002.ref,19',
 '002.ref,20',
 '002.ref,21',
 '002.ref,22',
 '002.ref,23',
 '002.ref,24',
 '002.ref,25',
 '002.ref,26',
 '002.ref,27',
 '002.ref,28',
 '002.ref,29',
 '002.ref,30',
 '002.ref,31',
 '002.ref,32',
 '002.ref,33',
 '002.ref,34',
 '002.ref,35',
 '002.ref,36',
 '002.ref,37',
 '002.ref,38',
 '002.ref,39',
 '002.ref,40',
 '002.ref,41',
 '002.ref,42',
 '002.ref,43',
 '002.ref,44',
 '002.ref,45',
 '002.ref,46',
 '002.ref,47',
 '002.ref,48',
 '002.ref,49',
 '002.ref,50',
 '002.ref,51',
 '002.ref,52',
 '002.ref,53',
 '002.ref,54',
 '002.ref,55',
 '002.ref,56',
 '002.ref,57',
 '002.ref,58',
 '002.ref,59',
 '002.ref,60',
 '002.ref,61',
 '002.ref,62',
 '002.ref,63',
 '002.ref,64',
 '002.ref,65',
 '002.ref,66',
 '002

Write your code below to generate TF-IDF vector.

In [17]:
#########please write the missing code below#######
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input = 'content', analyzer = 'word')
tfidf_vectors = tfidf_vectorizer.fit_transform(sents_words)
tfidf_vectors.shape

(3281, 6506)

Now you should think about how to save the TF-IDF vector for each sentences in each report. Let's print out the sparse vector for the first sentence in "002.ref".

In [18]:
vocab = tfidf_vectorizer.get_feature_names()
for word, weight in zip(vocab, tfidf_vectors.toarray()[0]):
    if weight > 0:
        print word, ":", weight

clinicians : 0.316406000789
diagnostic : 0.225621475546
examination : 0.225160186953
greater : 0.33592441701
history : 0.196377281577
investigations : 0.375692150284
laboratory : 0.250824965944
medical : 0.229468751274
patient : 0.150956882278
physical : 0.220341630219
rate : 0.282126767625
results : 0.264622676996
rich : 0.411343020291


## Task 5 Save the processed text
In this task, we need to save the TF-IDF vectors in a sparse format. There are difference ways of generating the sparse format. For example, you can consider the data format used by <font color="red">tfidf_vectors</a>

In [19]:
import itertools
save_file = open("clinical.txt", 'w')
#########please write the missing code below#######
cx = tfidf_vectors.tocoo() # return the coordinate representation of a sparse matrix
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
    save_file.write(sents_ids[i] + ':' + str(j) + ',' + str(v) + '\n')
save_file.close()

## Task 6 Save the final vocabulary 
In this task, you should save the final vocabulary in a file.

In [20]:
v_writer = open("clinical_voc.txt", "w")
for type in vocab:
    v_writer.write(type+"\n")
v_writer.close()