# Task 1 Reconstruct the Original Meeting Transcripts

#### Student Name: Vipul Krishnan Muralee Dharan
#### Student ID: 28104641

Date: 02/06/2018

Version: 1.0

Environment: Python 3.6.1 and Anaconda 4.3.21 (64-bit)

Libraries used:
* bs4 v4.6.0(for xml retrievel)
* os (comes with python) (for retrieving file list in the folder)


## 1. Introduction
The original meeting transcripts are stored in three different types of XML files, which are ending with ".words.xml", ".topic.xml" and ".segments.xml". (The details about the three types of files can be found in Section 3 below). The task here is to reconstruct the original meeting transcripts with the corresponding topical and paragraph boundaries from these files.

## 2. Importing Libraries

In [10]:
# importing libraries
from bs4 import BeautifulSoup as bsoup
import os

BeautifulSoup is used to easily retrive data from the xml files (Crummy, 2018).

os is used to find the files inside the computer folders (Python Software Foundation, 2018).

## 3. Defining Required Variables and Functions

We first define the path variables for easiness

In [11]:
# defining path variables
topic_file_path = "./topics" 
word_file_path = "./words" 
segment_file_path = "./segments" 

We first define the main function which takes the filename of the topic file as an input and fetches the text for that particular file.

In [12]:
# the main function which integrates the text for a topic file and 
# saves it as a .txt file
def getTopics(tfile):
    
    # initilizing the text content as empty
    text = ""
    
    # a dictionary store the entire text related to the current tipic file
    # key is the word file id
    vocD = {}
    
    # a dictionary to store the segment breaks
    # to be used to find the place where to insert new lines
    mapD = {}
    
    # opening the topic file in beutiful soap
    tSoup = bsoup(open("./topics/"+tfile), 'lxml')
    
    # fetch all word files...
    for wfile in os.listdir(word_file_path):
        # ..which is related to the current topic file
        if wfile.split('.')[0] == tfile.split('.')[0]:
            # insert all text (found using the function getWords) into the vocD dictionary
            vocD[wfile.split('.')[0]+'.'+wfile.split('.')[1]] = getWords(wfile.split('.')[0]+'.'+wfile.split('.')[1])
    
    # fetch all segment files..
    for sfile in os.listdir(segment_file_path):
        # .. which are related to the current topic file
        if sfile.split('.')[0] == tfile.split('.')[0]:
            # find the beginning of the segments (using the function getMapping) and add to segment directory
            mapD[sfile.split('.')[0]+'.'+sfile.split('.')[1]] = getMapping(sfile.split('.')[0]+'.'+sfile.split('.')[1])
    
    # for each topic in the topic file..
    for topic in tSoup.find("nite:root").findAll('topic', recursive=False):
        # .. retrieve the text of the topic using getTopic function and add it to the text variable
        # vocD and mapD are passed as attributes.
        text = text + " " + getTopic(topic, vocD, mapD).strip() + "\n"
    
    # once all the topics are completed, write the text variable into txt file    
    f= open("txt_files/"+ tfile.split('.')[0] + ".txt","w+")
    f.write(text)

The above function simpy iterate through the topics in it and call getTopic function to get the topic text. Pleas Once all the topic texts are collected, it writes into the file.

Please note that the text and segment breaks are collected as common dictionaries so that the getTopic function does not have to open the files again and again. This decreases the code running time very much and takes around 2 minutes only.

we now define, getWords function

In [13]:
# the function takes the word file reference as the parameter and returns all words in it
# the index of the word in the returned list is equal to its ID
def getWords(id):
    
    # list for words
    wList = []
    
    # missing word verifier
    # some of the word numbers are missing. This is used to correct 
    verify = 0 
    
    # opens the file in bsoup
    WSoup = bsoup(open(word_file_path+'/'+id+".words.xml"), 'lxml')
    
    # for each word..
    for word in WSoup.find("nite:root").findAll():
        #.. word id is collected
        wordid = word['nite:id'].split('words')[-1]
        
        # if the wordid is not digits, give up this iteration
        # this is to make sure that the wordx tags are not considered
        if not wordid.isdigit():
            continue
            
        # missing word verification
        # if a mismatch between the id of word and the maintained count
        if int(word['nite:id'].split('words')[-1]) != verify:
            # add empty strings to list to adjust
            for i in range(int(word['nite:id'].split('words')[-1]) - verify): 
                wList.append("")
            verify = int(word['nite:id'].split('words')[-1])
        
        # if there is a word in the tag value, add that word
        if word.string != None:
            wList.append(word.string.strip())
        # else append an empty string
        else:
            wList.append("")
        
        # verification count increases
        verify = verify + 1
        
    return wList

This function returns a word list curresponding to a word file where the index of the word in the list is the id of the word.

Now we define the getMapping function. This function obtains the details of segment breaks and we can use this data to put line breaks.

In [14]:
# function returns a list containing the id of the words which are the starting of a segment
def getMapping(id):
    
    # the reurn list
    retList = []
    
    # opening the file using the bsoup
    SSoup = bsoup(open(segment_file_path+'/'+id+".segments.xml"), 'lxml')
    
    # for each segment in the file..
    for segment in SSoup.find("nite:root").findAll('segment'):
        #... find the starting word of the segment
        hrefArray = segment.find('nite:child')['href'].split('.')
        # add the id of the word to the list
        retList.append(int(hrefArray[5][5:-1]))
    
    # return the list
    return retList

The function simply finds the word ids with which the segments starts and append it to a list. This list can be used when making the entire text. Whenever a word comes in the text whose id is in the above obtained list, we will have to put a line break there.

Now we define the getTopic function

In [15]:
# function obtains the complete text for a single topic
# it receives the vocD, and mapD from the calling function
# isSub attibutes specifies if the topic is main topic or subtopic
def getTopic(topic, vocD, mapD, isSub=0):
    
    # the text to be returned
    ttext = ""
    
    # find all the child s in the topic tag non recursively  
    for y in topic.findAll(recursive=False):
        
        # if it is a subtopic, call the function recursively again with isSub=1
        if y.name == 'topic':
            ttext = ttext.strip() + "\n " + getTopic(y, vocD, mapD,1).strip() + '\n'
            
        # if it is a child element
        elif y.name == 'nite:child':
            # separate different parts in the href section of the element
            hrefArray = y['href'].split('.')
            
            # if the child element contains more than one word (ie if it is a range)
            if len(hrefArray) > 7:
                # for each word id in that range 
                for i in range(int(hrefArray[5][5:-1]), int(hrefArray[9][5:-1])+1):
                    # .. checks if the word is start of a segment, if yes a new line char is inserted
                    if i in mapD[hrefArray[0]+'.'+hrefArray[1]]:
                        ttext = ttext.strip() + '\n'
                    # .. the word curresponding to the id is obtained from vocD and appended to the ttext
                    ttext = (ttext.strip(" ") + " " + vocD[hrefArray[0]+'.'+hrefArray[1]][i].strip() if vocD[hrefArray[0]+'.'+hrefArray[1]][i] != "" else ttext)
            
            # if there is only one word in the child elememnt
            elif len(hrefArray) > 3:
                # check if that word is the start of a segment, if yes ass a new line
                if (int(hrefArray[5][5:-1]) in mapD[hrefArray[0]+'.'+hrefArray[1]]):
                    ttext = ttext.strip() + '\n'
                # find the word from the vocD dictionary using the id and append
                ttext = (ttext.strip(" ") + " " + vocD[hrefArray[0]+'.'+hrefArray[1]][int(hrefArray[5][5:-1])] if vocD[hrefArray[0]+'.'+hrefArray[1]][int(hrefArray[5][5:-1])] != "" else ttext)# if vocList[int(hrefArray[5][5:-1])] != "<expression>" else ""
            ttext = ttext.strip() + '\n'
    
    # if it is not a subtopic, add the topic boundary
    if isSub != 1:
        ttext = ttext.strip() + "\n**********\n"
    
    # return the text
    return ttext

## 4. Calling the Function

For executing the task, we find each topic files in the topic folder and call getTopics of function with each file name. This completes the whole task.

In [16]:
# for each file in the topic folder
for tfile in os.listdir(topic_file_path): 
    tfile = os.path.join(topic_file_path, tfile)
    # checking if the file is xml file
    if os.path.isfile(tfile) and tfile.endswith('.xml'):
        # taking only the file name
        tfile = tfile.split('\\')[-1]
        # calling the getTopic function
        getTopics(tfile)
        print(tfile + " : Done!")

ES2002a.topic.xml : Done!
ES2002b.topic.xml : Done!
ES2002c.topic.xml : Done!
ES2002d.topic.xml : Done!
ES2003a.topic.xml : Done!
ES2003b.topic.xml : Done!
ES2003c.topic.xml : Done!
ES2003d.topic.xml : Done!
ES2004a.topic.xml : Done!
ES2004b.topic.xml : Done!
ES2004c.topic.xml : Done!
ES2004d.topic.xml : Done!
ES2005a.topic.xml : Done!
ES2005b.topic.xml : Done!
ES2005c.topic.xml : Done!
ES2005d.topic.xml : Done!
ES2006a.topic.xml : Done!
ES2006b.topic.xml : Done!
ES2006d.topic.xml : Done!
ES2007a.topic.xml : Done!
ES2007b.topic.xml : Done!
ES2007c.topic.xml : Done!
ES2007d.topic.xml : Done!
ES2008a.topic.xml : Done!
ES2008b.topic.xml : Done!
ES2008c.topic.xml : Done!
ES2008d.topic.xml : Done!
ES2009a.topic.xml : Done!
ES2009b.topic.xml : Done!
ES2009c.topic.xml : Done!
ES2009d.topic.xml : Done!
ES2010a.topic.xml : Done!
ES2010b.topic.xml : Done!
ES2010c.topic.xml : Done!
ES2010d.topic.xml : Done!
ES2011a.topic.xml : Done!
ES2011b.topic.xml : Done!
ES2011c.topic.xml : Done!
ES2011d.topi

The whole task completes in 1 to 2 minutes depending on the machine.

## 5. Summary

The text for all the topic files are obtained from curresponding word files, inserted the segment breaks according to the segment file datails and saved as txt files. The overall task takes 2 - 3 minutes depending on the running machine. 

## References

- Crummy. (2018). 'Beautiful Soup Documentation'. Retrieved from https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Python Software Foundation. (2018). "Miscellaneous operating system interfaces". Retrieved from https://docs.python.org/3.4/library/os.html