# Task 2 Generate Sparse Representations 

#### Student Name: Vipul Krishnan Muralee Dharan
#### Student ID: 28104641

Date: 02/06/2018

Version: 1.0

Environment: Python 3.6.1 and Anaconda 4.3.21 (64-bit)

Libraries used:
* bs4 v4.6.0(for xml retrievel)
* os (comes with python) (for retrieving file list in the folder)
* re v2.2.1(regular ecpression operations)
* nltk v3.2.3 (for using regular expression tokenizer)
* Counter (Comes in python collection)


## 1. Introduction
The aim of this task is to build sparse representations for the meeting transcripts generated in task 1, which includes word tokenization, vocabulary generation, and the generation of sparse representations.

## 2. Importing Libraries

In [11]:
from bs4 import BeautifulSoup as bsoup
import re
import os
from nltk.tokenize import RegexpTokenizer
from collections import Counter

BeautifulSoup is used for retrieving xml (Crummy, 2018). RegexpTokenizer and re are used for regluar expression tokenization (NLTK, 2018). os is used for finding the files inside a folder (Python Software Foundation, 2018). Counter is used for finding the count of each word in a paragraph (Python Software Foundation, 2018).

## 3. Generating Unigram Vocabulary File

We first save all the text files as a dictionary, so that we dont need to open the files multiple times. The key of the dictonary is the path of the file and the value is the content of the file

In [12]:
# the path to text file folder
txt_file_path = "./txt_files"

# dict for storing the file contents
vocD = {}

# for eachc file
for tfile in os.listdir(txt_file_path): 
    tfile = os.path.join(txt_file_path, tfile)
    # .. exclude the sample file and some unknown file in case if it is there
    if tfile not in ['./txt_files\.DS_Store', './txt_files\example_output.txt']:
        # ..store thec content of the file into the dict
        vocD[tfile] = open(tfile,'r').read()

Now we collect the whole words together in a list.

In [13]:
# all word list
allWordList = []

# obtaining the stopwords from the given files
stopwords = open('./stopwords_en.txt','r').read()
stop_word_list = stopwords.split('\n')

# dictionary similar to vocD, but the value is an array of words
# defined for future use
vocListD = {}

# defining tokenizer
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

# for each file texts (the keys in the dict represents different files)
for k in vocD.keys():
    # storing the text in a separate variable
    text = vocD[k]
    # making all lower case
    text = text.lower()
    # tokenizing 
    wList = tokenizer.tokenize(text)
    # checks for stop words and words less than length of 2
    wList = [w for w in wList if (w not in stop_word_list) and (len(w)>2)]
    vocListD[k] = wList
    # appending to the main list    
    allWordList = allWordList + wList

Now we make the words unique and introduce the document frequency constraint

In [14]:
# making the elements in the list unique
allWordList = list(set(allWordList))

finalList = []

# for each element in the list
for word in allWordList:
    count = 0
    # checks for document frequency contraint
    # finds the document frequency as count
    for k in vocD.keys():
        if word in vocListD[k]:
            count = count + 1
    # if it is less than 133, added to new list
    if count <= 132:
        finalList.append(word)
        
# this cell takes 1 to 1.5 minutes to run

Now we add this to the file in the required format after arranging in alphabetic order

In [15]:
# sorting
sortedList = sorted(list(set(finalList)))

# opening fike
f = open('vocab.txt','w')

# for each word index
for i in range(len(sortedList)):
    # write word and word index
    f.write(sortedList[i]+":"+str(i)+"\n")
f.close()

## 4. Generating Topic Boundary Encoded File

First we create a dictionary where the key is same as the key of the vocD (the file paths), and the values are the vecctor showing the segment breaks and topic breaks as the specified format

In [16]:
# initializing dictionary
vecD = {}

# for each entries in vocD
for k in vocD.keys():
    doc = vocD[k]
    vector = []
    # for each word in the text
    for l in range(len(doc)):
        # if it is a new line
        if doc[l] == '\n':
            # if it is the end of the text, append 0
            if l+1 == len(doc):
                vector.append('0')
            # else if it is a topic start, do nothing
            elif doc[l-1] == '*':
                pass
            # else if it is a topic end, add one
            elif doc[l+1] == '*':
                vector.append('1')
            # else append 0
            else:
                vector.append('0')
    # add the vector to vecD dict
    vecD[k] = vector

Now we add this to the file

In [17]:
# open the file
f = open('topic_segs.txt','w')

# for each entry in the vecD dict
for k in vecD.keys():
    # the line string to be written
    toWrite = k.split('\\')[-1].split('.')[0]+":"
    
    # fixing the comma problem. Comma shouldnt be there at the end of the line
    for i in range(len(vecD[k])):
        if i != len(vecD[k]) - 1:
            toWrite = toWrite + vecD[k][i] + ","
        else:
            toWrite = toWrite + vecD[k][i] + "\n"
    # writing to file
    f.write(toWrite)
# closing the file
f.close()

## 5. Generating the Sparse Files

The below code generates the sparse files. The code takes each file's text, each paragraph in the text and tokenizes it. Then for each word, it checks if the word exists in the vocabulary. If yes it adds the word id with count.

Note: for this task, it is assumed that the paragraph mentioned in the question means the paragraphs in the txt file and it is independent of the segment.xml file.

In [18]:
# for each text content stored in vocD
for k in vocD.keys():
    # initializing the main text for file
    text = ""
    transcript = vocD[k]
    
    # separates the paragraphs
    paraList = transcript.split("\n")
    for para in paraList:
        line = ""
        # separating each word in the paragraph
        wList = tokenizer.tokenize(para.lower())
        
        # finding the unique words and counts
        wordD = Counter(wList)
        
        # for each word in the current wordD keys
        for w in wordD.keys():
            # if the word is in vocabulary list
            if w.lower().strip() in sortedList:
                # add the word id and count in the vocab file
                line = line + str(sortedList.index(w.lower())) + ":" + str(wordD[w]) + ","
        # give up the empty lines
        if line == "":
            continue
        # adding the line to the main text of the file
        text = text + line[:-1] + "\n"
    # opening the file for the current txt file
    f = open('./sparse_files/'+k.split('\\')[-1],'w')
    # writing
    f.write(text)
    # closing the file
    f.close()
    print(k + " : Done!")
# the code takes 3 to 4 minutes depending the machine

./txt_files\ES2002a.txt : Done!
./txt_files\ES2002b.txt : Done!
./txt_files\ES2002c.txt : Done!
./txt_files\ES2002d.txt : Done!
./txt_files\ES2003a.txt : Done!
./txt_files\ES2003b.txt : Done!
./txt_files\ES2003c.txt : Done!
./txt_files\ES2003d.txt : Done!
./txt_files\ES2004a.txt : Done!
./txt_files\ES2004b.txt : Done!
./txt_files\ES2004c.txt : Done!
./txt_files\ES2004d.txt : Done!
./txt_files\ES2005a.txt : Done!
./txt_files\ES2005b.txt : Done!
./txt_files\ES2005c.txt : Done!
./txt_files\ES2005d.txt : Done!
./txt_files\ES2006a.txt : Done!
./txt_files\ES2006b.txt : Done!
./txt_files\ES2006d.txt : Done!
./txt_files\ES2007a.txt : Done!
./txt_files\ES2007b.txt : Done!
./txt_files\ES2007c.txt : Done!
./txt_files\ES2007d.txt : Done!
./txt_files\ES2008a.txt : Done!
./txt_files\ES2008b.txt : Done!
./txt_files\ES2008c.txt : Done!
./txt_files\ES2008d.txt : Done!
./txt_files\ES2009a.txt : Done!
./txt_files\ES2009b.txt : Done!
./txt_files\ES2009c.txt : Done!
./txt_files\ES2009d.txt : Done!
./txt_fi

## 6. Summary

Generated the Vocabulary file, Topic boundary encoded file and sparse files from the txt files generated in the previous task.

## References

- Crummy. (2018). 'Beautiful Soup Documentation'. Retrieved from https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Python Software Foundation. (2018). "Miscellaneous operating system interfaces". Retrieved from https://docs.python.org/3.4/library/os.html
- Python Software Foundation. (2018). "collections — High-performance container datatypes". Retrieved from https://docs.python.org/2/library/collections.html
- NLTK. (2018). "NLTK 3.3 documentation". Retrieved from https://www.nltk.org/py-modindex.html