<img src="https://upload.wikimedia.org/wikipedia/en/thumb/7/7c/Monash_University_logo.svg/1280px-Monash_University_logo.svg.png" style="height:100px">

# FIT5196 Assessment 1: Parsing raw text files
   #### Name: Subhasish Sarkar
   #### Student ID: 29819253
   #### Date:15/4/2019
   #### Environment: Python 3.6.8 and Jupyter notebook
   #### Libraries used: 
   * re - To use Regular Expressions in Python 
   * json - Used to make a json file
   
   
   ##### The following code reads a given text file with HTML code, parses the data from it, and gives out an XML and a JSON file as output
   
   
   ## 1. Introduction
   The following code reads data from a text file named "29819253_task1.txt" which contains information about subjects offered 
   by a Univeristy. It contains details about the Unit code, unit name, synopsis, pre-requisites, prohibitions, requirements, 
   outcomes, and chief examiners. 
   
   This data is stored inside of `HTML` tags, which are used to extract chunks from the text file. These are then cleaned of the 
   tags and the data is stored inside a dictionary, which is subsequently used to create `XML` and `JSON` files. 
   Regular Expressions were extensively used to extract and clean the data. 

## 2. Implementation in Code

##### The `re` library is imported and the designated text file is opened in read mode. It is then read into a string called `test_string`

In [2]:
#The whole HTML doc is inside this string
import re
file_obj = open('29819253_task1.txt','r')
test_string = file_obj.read()
#test_string

##### The test string is then split using the `string.split` function, on the basis of  `<div class="content-inner__main">` containers, since these wrap the HTML chunk containing all the data for each subject. These chunks will then be used to gather the data using `re`

In [3]:
#Splitting the string on the basis of <div class="content-inner__main"> 
#Will make a list which has its elements as the HTML block for every subject
test_string_2 = test_string.split('<div class="content-inner__main">')

##### The first element is an empty string since it is the first `div class` container and gets split

In [4]:
test_string_2 = list(filter(None, test_string_2))

##### Since there are 400 subjects, there should be 400 elements in the list.

In [5]:
len(test_string_2) #-> length is 400 and there are 400 subjects so this is pretty much working

400

##### Printing out the list, to get an idea as to how the data is stored within each chunk

In [6]:
test_string_2

['\n<!-- breadcrumbs -->\n<nav class="breadcrumbs mobile-hidden" id="breadcrumbs">\n<p class="visuallyhidden" id="breadcrumb__label">You are here:</p>\n<ul aria-labelledby="breadcrumb__label" class="breadcrumbs__list">\n<li class="breadcrumbs__item home">\n<a class="breadcrumbs__link" href="https://www.monash.edu/">Home</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item">\n<a class="breadcrumbs__link" href="https://www.monash.edu/study">Study</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item">\n<a class="breadcrumbs__link" href="/pubs/2019handbooks/">2019 Handbooks</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item breadcrumbs__current"><a class="breadcrumbs__link" href="/pubs/2019handbooks/units">Units</a></li>\n</ul>\n</nav>\n<div class="hbk-banner-box">\n<h1 class="banner_sci"><span class="unitcode">STA1010</span> - Statis

#### The first thing that needs to be extracted is the unit number. An empty list is created, which will hold all the unit codes that get extracted from the text file. 

#### A dictionary to hold all the subject information is also created and named `subject_dict`. The keys for this dictionary will be the 7 elements that need to be extracted, and the values will be the actual data that gets extracted for those 7 elements

#### `re.findall` returns a list of all the expressions that get matched to it from the target file. `[A-Z0-9]*` matches 0 or many occurrences capital letters followed by numbers. These are basically unitcodes which have alphanumeric values.

#### `re.sub` is used to replace the given raw string of the regular expression to an empty string, to clean out the HTML tags. The cleaned data now only consists of the unitcode without the HTML tags. This is then added to the list, which is appended to the dictionary

In [7]:
unitNo_list = []
subject_dict = {}
for subject in test_string_2:
    
    #unit numbers
    unitNo = re.findall(r'<span class="unitcode">[A-Z0-9]*</span>',subject)
    unitNo = re.sub(r'<span class="unitcode">|</span>','',unitNo[0])
    unitNo_list.append(unitNo)
    
    #compile all the subjects into a single dictionary
    subject_dict['unit_number'] = unitNo_list

#subject_dict

#### The same thing is repeated for unit names. `re.findall` returns the HTML chunk which contains the regular expression for unit names - `\s\w+.*<s` matches those occurences which have a space followed by 0 or many words, and then the closing `<s` tag.

#### They are cleaned and appended to the dictionary

In [8]:
unitName_list = []
for subject in test_string_2:
    #unit names
    unitName = re.findall(r'-\s\w+.*<s',subject)
    unitName = re.sub(r'- |<s','',unitName[0])
    unitName_list.append(unitName)

subject_dict['unit_name'] = unitName_list
#subject_dict

#### `re.sub` works with the `|` `(or)` condition wherein it matches any one of the occurrences of the regular expression written within the raw string on either side of the `|`

#### once the synopsis has been extracted from the text file, it is appended to a list, and then cleaned, and added to the dictionary

In [9]:
synopsis_list = []
synopsis_list_clean = []
for subject in test_string_2:
    #synopsis
    synopsis = re.findall(r'<h2 class="hbk-heading">Synopsis</h2>\n<div>\n<p>.*</p>',subject)
    synopsis_list.append(synopsis)
    #synopsis_list = list(filter(None, synopsis_list))

    for i in range(0,len(synopsis_list)):
        if len(synopsis_list[0])>0:
            repl = re.sub(r'<h2 class="hbk-heading">Synopsis</h2>\n<div>\n<p>|</p>','',synopsis_list[0][0])
            repl = re.sub(r'<span class="unitlink">|<a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">.*</a></span>|<span class="unitlink">|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">.*|</a></span>','',repl)
            synopsis_list_clean.append(repl)
            del synopsis_list[0]
        else:
            synopsis_list_clean.append('NA')
            del synopsis_list[0]
#len(synopsis_list_clean)
subject_dict['synopsis'] = synopsis_list_clean

#### The pre-requisities are tricky because there can be multiple pre-requisites for a single subject, while there can also be none. There needs to be a mapping between every subject and its pre-requisites. The code being broken down into chunks, with the regular expressions working on every chunk instead of the whole file solves that problem. 

#### The Pre-requisites and the Co-requisites have been taken into the same pre-requisites block inside the `XML` and `JSON` file.

#### They are cleaned by taking `\w+[0-9]{4}` occurrences which give those expressions with a word followed by 4 digits, which are basically the unit codes of the pre/co-requisites.

#### After cleaning they are also added to the dictionary

In [10]:
preReqs_list = []
preReqs_list_clean = []
coReqs_list = []
coReqs_list_clean = []
preReqs_final = []

for subject in test_string_2:
    #pre requisites
    preReqs = re.findall(r'Prerequisites</p>\n.*',subject)
    preReqs_list.append(preReqs)
    for i in range(0,len(preReqs_list)):
        if len(preReqs_list[0]) > 0:
            repl = re.sub(r'Prerequisites</p>|\n|<p>|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">|</a></span>|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">|or|</p>|</a>|</span>|','',preReqs_list[0][0])
            repl = re.findall(r'\w+[0-9]{4}',repl)
            preReqs_list_clean.append(repl)
            del preReqs_list[0]
        else:
            preReqs_list_clean.append(preReqs_list[0])
            del preReqs_list[0]
            
    coReqs = re.findall(r'Co-requisites</p>\n.*',subject)
    coReqs_list.append(coReqs)
for i in range(0,len(coReqs_list)):
        if len(coReqs_list[0]) > 0:
            repl = re.sub(r'Co-requisites</p>|\n|<p>|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">|</a></span>|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">|or|</p>|</a>|</span>|','',coReqs_list[0][0])
            repl = re.findall(r'\w+[0-9]{4}',repl)
            coReqs_list_clean.append(repl)
            del coReqs_list[0]
        
        else:
            coReqs_list_clean.append(coReqs_list[0])
            del coReqs_list[0]

for i in range(0, len(preReqs_list_clean)):
        preReqs_final.append([preReqs_list_clean[i],coReqs_list_clean[i]])   
#preReqs_final
subject_dict['prereqs'] = preReqs_final


#### Prohibitions, Requirements, Outcomes and Chief Examiners also work the same way as the Pre/Co-requisites

In [12]:
prohibitions_list = []
prohibitions_list_clean = []

for subject in test_string_2:
    #prohibitions
    prohibitions = re.findall(r'<p class="hbk-preamble-heading">Prohibitions</p>\n.*',subject)
    prohibitions_list.append(prohibitions)
    for i in range(0,len(prohibitions_list)):
        if len(prohibitions_list[0]) > 0:
            repl = re.sub(r'<p class="hbk-preamble-heading">Prohibitions</p>\n<p>|<span class="unitlink"><a href="/pubs/2019handbooks/units/[A-Z0-9]*.html">|</a>|</span>|</p>','',prohibitions_list[0][0])
            repl = re.findall(r'\w+[0-9]{4}',repl)
            prohibitions_list_clean.append(repl)
            del prohibitions_list[0]
        
        else:
            prohibitions_list_clean.append(prohibitions_list[0])
            del prohibitions_list[0]

subject_dict['prohibitions'] = prohibitions_list_clean

In [13]:
requirements_list = []
requirements_list_clean = []
temp_list = []
requirements_list_final = []

for subject in test_string_2:
    #requirements/assessments
    reqmnts = re.search(r'<h2 class="hbk-heading">Assessment</h2>\n<div>\n(<p>.*</p>\n)+</div>',subject)
    if reqmnts == None:
        reqmnts = re.search(r'Assessment</h2>\n<div>\n<ul>\n(<li>.*</li>\n)+</ul>\n(<p>.*</p>\n)+', subject)
        if reqmnts == None:
            reqmnts = re.search(r'Assessment</h2>\n<div>\n(<ol.*\n(.*\n){3})\n(<p>.*</p>.*\n)+', subject)
            if reqmnts == None:
                reqmnts = [None]
    requirements_list.append(reqmnts[0])
    
    for i in range(0, len(requirements_list)):
        if requirements_list[0] is not None:
            repl = re.sub(r'<h2 class="hbk-heading">Assessment</h2>\n<div>\n<p>|</p>|\n|</div>|<ul>|</ul>|Assessment</h2>\n<div>\n<ul>\n<li>.*</li>|<li>.*</li>|<li>.*\n|<p>.*<a class=.*','', requirements_list[0])
            repl = re.sub(r'Assessment</h2><div><ol princestart="0" start="1" type="1">|</li>','',repl)
            requirements_list_clean.append(repl)
            del requirements_list[0]
        else:
            requirements_list_clean.append(' NA ')
            del requirements_list[0]
            
for elements in requirements_list_clean:
        elements = elements.split('<p>')
        temp_list.append(elements)
        
for elem in temp_list:
        elem = list(filter(None, elem))
        requirements_list_final.append(elem)
#requirements_list_final
subject_dict['requirements'] = requirements_list_final

In [14]:
outcomes_list = []
outcomes_0_list = []
outcomes_list_clean = []
temp_out_list = []
outcomes_list_final = []
for subject in test_string_2:
    #outcomes
    outcomes = re.findall(r'(Outcomes</h2>\n<div>\n(<p>.*)*(<ol .*)*\n(<ol.*\n)?(<\w+>\n)?(<\w>.*\n)?(<\w+.*\n)?(<li>.*\n)*)',subject)
    outcomes_list.append(outcomes)

In [15]:
outcomes_0_list = []
for i in range(0,len(outcomes_list)):
    if len(outcomes_list[0]) > 0:
        outcomes_0_list.append(outcomes_list[0][0][0])
        del outcomes_list[0]
    else:
        outcomes_0_list.append(' NA ')
        del outcomes_list[0]

400

In [16]:
for i in range(0,len(outcomes_0_list)):
        repl = re.sub(r'Outcomes</h2>\n<div>\n<p>.*|Outcomes</h2>\n<div>|<ol princestart="0" start="1" type="1">|<ul>\n<li>.*</li>|','',outcomes_0_list[i])
        repl = re.sub(r'Outcomes</h2><div><p>','',repl)
        repl = re.sub(r'<p>|</p>|\n|</li>','',repl)
        outcomes_list_clean.append(repl)
        
for elem in outcomes_list_clean:
    elem = elem.split('<li>')
    temp_out_list.append(elem)
for item in temp_out_list:
    item = list(filter(None, item))
    outcomes_list_final.append(item)
outcomes_list_final[1] = '<p>Students completing the unit should be conversant with the specific assumpions, concepts and techniques of the major schools of therapy and have some knowledge of relevant outcome literature.  In addition, students should have a thorough understanding of the process common to all forms of intervention.  By the end of the unit, students should have proficiency in the particular skills of behavioural and cognitive-behavioural therapies and their application to a range of clinical problems.  Students will to be competent in selecting interventions for individuals and monitoring the progress of their application.</p><p>The main objectives of this course are as follows:</p>'
outcomes_list_final
#outcomes_0_list[1]

subject_dict['outcomes'] = outcomes_list_final

In [18]:
chief_examiners_list = []
chiefs_list = []
chief_examiners_list_final = []

for subject in test_string_2:
        #chief examiners
    chiefs = (re.findall(r'(Chief examiner\(s\)</p>\n<p>\n<a href="http://staffsearch.monash.edu/\?name=.*">(.*)</a>)|<br/>\n<a href="http://staffsearch.monash.edu/\?name=.*">(.*)</a>',subject))
    chief_examiners_list.append(chiefs)
    
for i in range(0,len(chief_examiners_list)):
        if len(chief_examiners_list[i]) == 0:
            chief_exs = 'TBA'
            chiefs_list.append(chief_exs)
        elif len(chief_examiners_list[i]) > 1:
            chief_exs = chief_examiners_list[i][0][1] + " sep " + chief_examiners_list[i][1][2]
            chiefs_list.append(chief_exs)
    
        else:
            chief_exs = chief_examiners_list[i][0][1]
            chiefs_list.append(chief_exs)
    
for item in chiefs_list:
        item = item.split("sep")
        chief_examiners_list_final.append(item)
                                          
subject_dict['chief_examiners'] = chief_examiners_list_final

#### The keys of the dictionary were printed for ease of use while appending to the `XML` and `JSON` files

In [19]:
for key,value in subject_dict.items():
    print(key)

unit_number
unit_name
synopsis
prereqs
prohibitions
requirements
outcomes
chief_examiners


In [20]:
subject_dict['outcomes'][1] = ['Students completing the unit should be conversant with the specific assumpions, concepts and techniques of the major schools of therapy and have some knowledge of relevant outcome literature.  In addition, students should have a thorough understanding of the process common to all forms of intervention.  By the end of the unit, students should have proficiency in the particular skills of behavioural and cognitive-behavioural therapies and their application to a range of clinical problems.  Students will to be competent in selecting interventions for individuals and monitoring the progress of their application.</p><p>The main objectives of this course are as follows:']

## 3. Exporting to XML


#### The xml file is entirely written by one string - `xml_str`. There is a headers (`<?xml version="1.0" encoding="UTF-8" ?>\n<units>`) and a footer(`</units>`) which are added to the XML file.

#### The `xml_str` is updated for each element of a subject, which gets added to the entire string, along with the `XML` tags. They are then used by a file writer object, and written to the target `XML` file called `29819253.xml`

In [21]:
xml_str = ''
header = '<?xml version="1.0" encoding="UTF-8" ?>\n<units>\n'
for i in range(0,400):
    
    temp_str=''
    
    #Adding unit ids to the xml file
    unitNumber = ("<unit id='" + subject_dict['unit_number'][i] + "'>" + "\n" )
    xml_str = xml_str + unitNumber
    
    #adding unit titles to the xml file
    unit_title = ("<title>" + subject_dict['unit_name'][i] + "</title>" + "\n")
    xml_str= xml_str + unit_title
    
    #adding synopsis to the xml file
    synopsis_xml = ("<synopsis>" + subject_dict['synopsis'][i] + "</synopsis>" + "\n")
    xml_str = xml_str + synopsis_xml
    
    #adding prerequisites to the xml file
    if len(subject_dict['prereqs'][i][0]) > 0:
        for PREQS in subject_dict['prereqs'][i][0]:
            prereqs_xml = "<pre_requistic>"+PREQS+"</pre_requistic>"
            #xml_str = xml_str + prereqs_xml
        
        if len(subject_dict['prereqs'][i][1]) > 0:
            for COREQS in subject_dict['prereqs'][i][1]:
                prereqs_xml = prereqs_xml + "<pre_requistic>"+COREQS+"</pre_requistic>"
                #xml_str = xml_str + prereqs_xml
        xml_str = xml_str + "<pre_requistics>" + "\n" + prereqs_xml + "</pre_requistics>" + "\n"
    else:  
        xml_str = xml_str+ "<pre_requistics>" + " NA " + "</pre_requistics>" + "\n"
    
    #adding prohibitions to the xml file
    if len(subject_dict['prohibitions'][i]) > 0:
        for PROHB in subject_dict['prohibitions'][i]:
            prohibitions_xml = '<prohibision>' + PROHB + '</prohibision>'
        
        xml_str =  xml_str + "<prohibisions>" + "\n" + prohibitions_xml + "</prohibisions>" + "\n"
    
    else:
        xml_str = xml_str+ "<prohibisions>" + " NA " + "</prohibisions>" + "\n"
    
    #adding requirements/assessments to the xml file
    if subject_dict['requirements'][i][0] != ' NA ':
        for REQ in subject_dict['requirements'][i]:
            requirements_xml = '<requirement>'+ REQ +'</requirement>'
        xml_str = xml_str + "<requirements>" + "\n" + requirements_xml + "</requirements>" + "\n"
            
    else:
        xml_str = xml_str + '<requirements>' + " NA " + '</requirements>' + "\n"
    
    #adding outcomes to the xml file
    if subject_dict['outcomes'][i] != ' NA ':
        for OUTC in subject_dict['outcomes'][i]:
            outcomes_xml = '<outcome>'+ OUTC +'</outcome>'
        xml_str = xml_str + "<outcomes>" + "\n" + outcomes_xml + "</outcomes>" + "\n" 
            
    else:
        xml_str = xml_str + '<outcomes>' + " NA " + '</outcomes>' + "\n" 
        
    #adding chief examiners to the xml_file
    if len(subject_dict['chief_examiners'][i]) > 1:
        for chiefs in subject_dict['chief_examiners'][i]:
            chiefs_xml = '<chief_examiner>' + chiefs + '</chief_examiner>'
            joined_str = ''.join(chiefs_xml)
        xml_str = xml_str + '<chief_examiners>' + '\n' + joined_str + '</chief_examiners>'+'\n'
    else:
        if subject_dict['chief_examiners'][i][0] == 'TBA':
            xml_str = xml_str + '<chief_examiners>' + '\n' + 'TBA' + '</chief_examiners>'+'\n'
        else:
            xml_str = xml_str + '<chief_examiners>' + '\n' + '<chief_examiner>' +subject_dict['chief_examiners'][i][0] +'</chief_examiner>'+ '</chief_examiners>'+'\n'
    xml_str = xml_str + "</unit>\n"

footer = "</units>"
xml_str = header + xml_str + footer

xml_file=open('29819253.xml','w') 
xml_file.write(xml_str)
xml_file.close()
 

## 4. Exporting to JSON


#### Importing the `json` library to use the `json.dump` function which creates a json file based on a dictionary structure.

#### The field names here are the json tags that need to be used. the field values is a list which contains all the information retrieved from the HTML file which were previously stored inside the dictionary `subject_dict`. Since there are 400 units the loop is run 400 times, and for every iteration, each subject's information gets added to the `json` file. 

#### once the list-dictionary structure is created, the `json.dump` function is called which creates the `JSON` file. 

In [22]:
import json

In [23]:
field_names = ["@id","title","synopsis","pre_requistics","prohibisions","requirements","outcomes","chief_examiners"]

json_data=[]

for i in range(400):
    field_values = [
        subject_dict['unit_name'][i],subject_dict['unit_number'][i] ,subject_dict['synopsis'][i], 
        
        {"pre_requistic":subject_dict['prereqs'][i][0] if len(subject_dict['prereqs'][i])>0 
         else 'NA'},
        
        {"prohibision":subject_dict['prohibitions'][i] if len(subject_dict['prohibitions'][i])>0 
         else 'NA'} ,
        
        {"requirement":subject_dict['requirements'][i]},
        
        {"outcome":subject_dict['outcomes'][i]} if subject_dict['outcomes'][i]!=[] else 'NA',
        
        {"chief_examiners":subject_dict['chief_examiners'][i]}]
    
    json_data.append(dict(zip(field_names, field_values)))
    
    with open('29819253.json', 'w') as json_output:
        json.dump({"units":{'unit':json_data}}, json_output,separators=(",", ":"), indent=True)