# Task A: Parsing Raw Text Files
#### Name: Rohit Sanjay Tapas

Environment: Python 3 and Jupyter notebook

Libraries used: 
* re (for regular expression, included in Anaconda Python 3) 
* json (for converting txt file to json format)


## 1. Introduction
This program extracts data from semi structured html file. The html file consists of data from the Monash unit handbook.
Each unit has the following data:
* Unit code
* Pre-requisite units for the unit
* Prohibitions (The units which can not be taken in order to study the unit)
* Synopsis of the unit
* Requirements for the unit
* Outcomes of the unit
* Chief Examiners which will be examining the unit.

The RE library was used to extract all the reqiured elements from the html file.
The extracted elements are then printed inside the appropriate xml and json tags.

The json library is then used to format and indent the json file appropriately.


## 2.  Import libraries 

In [44]:
import re
import json

## 3.  Open files
* in my_file enter the name of the data file
* file is the name of the output file to write the xml data
* file1 is the name of the output file to write the temporary json data. The final json output file will be created later.


In [45]:
my_file = open('29812135.txt','r')       #open data file
file = open("task1_29812135.xml","w")              #open file to write xml output
file1 = open("A2.txt","w")             #open file to write json output
my = my_file.read()

## 4.  Initialize lists
* lists are inititialized to store the extracted data

In [46]:
UnitTitle = []
UnitCode =[]
Synopsis = []
Examiner = []
ExaminerFinal = []
Prohibitions = []
Outcomes = []
OutcomesFinal = []
Req = []
ReqFinal = []
PreReq = []

## 5.  Cleaning data by using Regular Expressions

* The raw data is in html format. So, the data which we need is enclosed between some tags. To extract this data we have used Regular Expressions.
* The initial regex finds the entire block of data we need and the secondary regex filters out the unwanted tags, leaving us with only the clean data.

### 5.1 Extracting all the required elements from the semi structured data and writing them into files
The unit codes, titles, synopsis, prerequisites, prohibitions, requirements, outcomes and chief examiners are extracted from the uncleaned data and written into files.

In [47]:
main_filter = re.findall('<div class(.*?)<!-- /.content-inner__main --></div>',my,re.DOTALL|re.MULTILINE)     #RE to filter each unit chunk
UnitPatt = re.compile('[e][\"][>][A-Z][A-Z][A-Z]\d\d\d\d|[e][\"][>][A-Z][A-Z][A-Z][A-Z]\d\d\d\d')  #RE to find unit codes and titles
FilterUnits = re.compile('[e][\"][>]')               #RE to filter unit codes
FilterTitle = re.compile('</span> - (.+)<span')      #RE to filter unit titles
FilterSyn1 = re.compile('<div>|<p>|</div>|</p>|\\\n|\a|\\|"]|"|]"')
FilterSyn2 = re.compile('<h2(.*?)<div>|<span(.*?)</span>')           #RE to filter unit synopsis
Exam1 = re.compile('<div>|<p>|</div>|</p>|\a|\(s\)|</a>')           
Exam2 = re.compile('<span(.*?)</span>|</a(.*?)">|<a(.*?)">|<p class(.*?)</p>|http(.*?).html|Researcher Profile')  #RE to filter chief examiners
Pro2 = re.compile('(<span(.*?).html">)|<span(.*?)</span>') #RE to filter prohibitions
Pro3 = re.compile('</span>|</a>|</>\\|<>|</p>|<p>')
OutFil1 = re.compile('<div>|\\|<p>|</div>|</p>|\\|\a|\(s\)|</a>|<li>|</li>|</ol>|</h2>|</span>|<ul>|</ul>|"')
OutFil2 = re.compile('<div>(.*?)1">|<ol(.*?)">|</ol>|<span(.*?)l">') #RE to filter outcomes
ReqFil = re.compile('<div>|</h2>|</div>|</p>|\n|\a|<li>|</li>|<ul>|</ul>|<a>|</a>|\n\n|<p>|<h2(.*?)</h2>|<ol(.*?)">|</ol>|<a class(.*?)">|<span class(.*?)</span>|"')
PReqFil = re.compile('(<span(.*?).html">)')  #RE to filter prerequisites
file1.write('{\n"units": {\n')
file1.write('"unit": [\n')
file.write("<units>\n")
m = len(main_filter)
p = 1
for each in main_filter:
        a = UnitPatt.findall(each)                        #Find all unit codes
        UnitTemp = FilterUnits.sub("",str(a))
        UnitTemp = UnitTemp.replace("['","")
        UnitTemp = UnitTemp.replace("']","")
        UnitCode.append(UnitTemp)
        file1.write('{\n"@id": "' + str(UnitTemp) + '",\n')  #write unit code to json file
        file.write('<unit id="' + str(UnitTemp) + '">\n')    #write unit code to xml file
        re.search(r'</span> - (.+)<span',each)               #find all unit titles
        TitleTemp = FilterTitle.findall(each)
        UnitTitle.append(TitleTemp)
        file1.write('"title": "'+str(TitleTemp[0]+ '",\n'))   #write unit pattern to json file
        file.write("<title>" + str(TitleTemp[0]) + "</title>\n")  #write unit pattern to xml file
        
        
        b = re.findall(r'<h2 class="hbk-heading">Synopsis<\/h2>\n<div>(.*?)</p>',each,re.DOTALL|re.MULTILINE)  #find synopsis
        SynTemp= FilterSyn2.sub('',str(b))  
        SynFinal= FilterSyn1.sub('',str(SynTemp))
        SynFinal = SynFinal.split('\\n')
        SynFinal = SynFinal[1].replace("['","")
        SynFinal = SynFinal.replace("']","")
        SynFinal = SynFinal.replace('"]',"")
        Synopsis.append(SynFinal)
        file1.write('"synopsis": "'+str(SynFinal)+ '",\n')             #write unit synopsis to json file
        file.write("<synopsis>\n" + str(SynFinal) + "\n</synopsis>\n") #write unit synopsis to xml file
        
        
        c = re.findall(r'heading">Prerequisites</p>(.*?)</p>\n</div>',each,re.DOTALL|re.MULTILINE)  #find all prerequisites
        PReqTemp = Pro2.sub('',str(c))
        PReqTempo = Pro3.sub("",str(PReqTemp))
        PReqFinal = re.findall("[A-Z][A-Z][A-Z]\d\d\d\d|[A-Z][A-Z][A-Z][A-Z]\d\d\d\d",PReqTempo)  
        if len(PReqFinal) == 0:                                        #check if prerequisites present
            file1.write('"pre_requistics": "NA",\n')                   #if none then print NA
            file.write("<pre_requistics> NA </pre_requistics>\n")
            #print(PReqTempo)
        elif len(PReqFinal) == 1:                                      #if only one prerequisite print according to file structure
            file1.write('"pre_requistics": {\n')
            file1.write('"pre_requistic": "'+str(PReqFinal[0])+ '"\n')
            file1.write('},\n')
            file.write("<pre_requistics>\n")
            file.write("<pre_requistic>" + str(PReqFinal[0]) + "</pre_requistic>\n")
            file.write("</pre_requistics>\n")
            
        else:
            file1.write('"pre_requistics": {\n')                       #if multiple prerequisites print according to file structure
            file1.write('"pre_requistic": [\n')
            file.write("<pre_requistics>\n")
            last1 = PReqFinal.pop()
            for eacha in PReqFinal:  
                if len(PReqFinal)>0:                    
                    file1.write('"'+str(eacha)+'",\n')
                    file.write("<pre_requistic>" + str(eacha) + "</pre_requistic>\n")
            file.write("<pre_requistic>" + str(last1) + "</pre_requistic>\n")
            file1.write('"'+str(last1)+'"\n')
            file1.write(']\n},\n')
            file.write("</pre_requistics>\n")
        if len(PReqFinal)>0:
             PreReq.append(PReqFinal)
        else:
            PreReq.append(["NA"])
        
                
        d = re.findall(r'<p class="hbk-preamble-heading">Prohibitions(.*?)</div>',each,re.DOTALL|re.MULTILINE)  #find all prohibitions
        ProTemp = Pro2.sub('',str(d))
        ProTempo = Pro3.sub("",str(ProTemp))
        
        ProFinal = re.findall("[A-Z][A-Z][A-Z]\d\d\d\d|[A-Z][A-Z][A-Z][A-Z]\d\d\d\d",ProTempo)  
        if len(ProFinal) == 0:                                                                 #check if prohibitions present
            file1.write('"prohibisions": "NA",\n')
            file.write("<prohibisions> NA </prohibisions>\n")                               #if no prohibitions then print NA
        elif len(ProFinal) == 1:                                                            #if one prohiition print according to file structure
            file1.write('"prohibisions": {\n')
            file1.write('"prohibision": "'+str(ProFinal[0]+ '"\n'))
            file1.write('},\n')
            file.write("<prohibisions>\n")
            file.write("<prohibision>" + str(ProFinal[0]) + "</prohibision>\n")
            file.write("</prohibisions>\n")
        else:                                                                               #if more than one prohibition print according to file structure
            file1.write('"prohibisions": {\n')
            file1.write('"prohibision": [\n')
            file.write("<prohibisions>\n")
            last = ProFinal.pop()
            for eacha in ProFinal:   
                if len(ProFinal)>0:
                    file1.write('"'+str(eacha)+'",\n')
                    file.write("<prohibision>" + str(eacha) + "</prohibision>\n")            
            file1.write('"'+str(last)+'"\n')
            file1.write(']\n},\n')
            file.write("<prohibision>" + str(last) + "</prohibision>\n")
            file.write("</prohibisions>\n")
        if len(ProFinal)>0:
             Prohibitions.append(ProFinal)
        else:
            Prohibitions.append(["NA"])

                
        
        f = re.findall(r'<h2 class="hbk-heading">Assessment(.*?)</div>',each,re.DOTALL|re.MULTILINE) #find all requirements
        ReqTemp = ReqFil.sub('',str(f))
        Req.append(ReqTemp)
        if len(ReqTemp) > 4:
            Req1 = ReqTemp.split('\\n')
            
            
        if len(ReqTemp) == 0:                                         #if no requirements then print NA
            file1.write('"requirements": "NA",\n')
            file.write("<requirements> NA </requirements>\n")
             
        elif len(Req1) == 4:                                          #if one requirement print according to file structure
            file.write("<requirements>\n")
            file1.write('"requirements": {\n')
            file1.write('"requirement": ')
            for eacha in Req1:   
                if len(eacha)>3:
                    file.write("<requirement>" + str(eacha) + "</requirement>\n")
                    file1.write('"'+str(eacha)+'"')                            
            file1.write('\n},\n')
            file.write("</requirements>\n")
        
        else:                                                        #if more than one requirements print according to file structure
            file1.write('"requirements": {\n')
            file1.write('"requirement": [\n')
            
            file.write("<requirements>\n")
            last = Req1.pop(-2)
            for eacha in Req1:   
                if len(eacha)>3:
                    file1.write('"'+str(eacha)+'",\n')
                    file.write("<requirement>" + str(eacha) + "</requirement>\n")          
            file1.write('"'+str(last)+'"\n')
            file1.write(']\n},\n')
            file.write("<requirement>" + str(last) + "</requirement>\n")
            file.write("</requirements>\n")
        
        if len(ReqTemp) > 4:
            Req1 = ReqTemp.split('\\n')
            ReqFinal.append(Req1)
        else:
            ReqFinal.append(["NA"])
        
        
        
        e = re.findall(r'<h2 class="hbk-heading">Outcomes(.*?)</div>',each,re.DOTALL|re.MULTILINE)  #find all outcomes
        #print(a)
        OutTemp= OutFil2.sub('',str(e))
        OutFinal= OutFil1.sub('',str(OutTemp))
        #print(OutFinal)
        Outcomes.append(OutFinal)
        if len(OutFinal) > 4:                                  #if no outcomes then print NA
            Out1 = OutFinal.split('\\n')
        if len(OutFinal) == 0:
            file1.write('"outcomes": "NA",\n')
            file.write("<outcomes> NA </outcomes>\n")
            
        else:
            file1.write('"outcomes": {\n')                    #if outcomes are present then print according to file structure
            file1.write('"outcome": [\n')
            file.write("<outcomes>\n")
            last = Out1.pop(-3)
            for eacha in Out1:   
                if len(eacha)>3:
                    file1.write('"'+str(eacha)+'",\n')
                    file.write("<outcome>\n" + str(eacha) + "\n</outcome>\n")          
            file1.write('"'+str(last)+'"\n')
            file1.write(']\n},\n')
            file.write("<outcome>\n" + str(last) + "\n</outcome>\n")
            file.write("</outcomes>\n")
         
        if len(OutFinal)>4:
            Out1 = OutFinal.split('\\n')
            OutcomesFinal.append(Out1)
        else:
            OutcomesFinal.append(["NA"])
        
        
                
        c = re.findall(r'<p class="hbk-highlight-heading">Chief examiner(.*?)<br/>',each,re.DOTALL|re.MULTILINE) #find all chief examiners
        ExamTemp= Exam2.sub('',str(c))
        ExamFinal = re.sub("<div>|<p>|</div>|</p>|\a|\(s\)|</a>|\\n","",ExamTemp)
        Examiner.append(ExamFinal)
        
        #print(ExamFinal)
        if len(ExamFinal) > 4:
            Exam1 = ExamFinal.split('\\n')
        if len(ExamFinal) == 0:                                             #if no chief examiners then print NA
            file1.write('"chief_examiner": "TBA",\n')
            file.write("<chief_examiners> TBA </chief_examiners>\n")
            
        else:
            file1.write('"chief_examiners": {\n')                           #if chief examiners present then print according to the file structure
            file1.write('"chief_examiner": ')
            file.write("<chief_examiners>\n")
              
            if len(Exam1)>3:
                file1.write('"'+str(Exam1[2])+'"\n')
                file.write("<chief_examiner>" + str(Exam1[2]) + "</chief_examiner>\n")          
            if p < m:
                file1.write('}\n},\n')
            else:
                file1.write('}\n}\n')
            p = p+1
            file.write("</chief_examiners>\n")
            file.write("</unit>\n")
                
        if len(ExamFinal) > 4:
            Exam1 = ExamFinal.split('\\n')
            ExaminerFinal.append(Exam1)
        else:
            ExaminerFinal.append(["TBA"])
file1.write(']\n}\n}')
file.write("</units>")        


8

### Unclean data:

For example, here the synopsis of the unit is enclosed between "h2 class="hbk-heading"Synopsis" and "/div>"
Hence we extract this chunk with our primary regex.

Next, with out secondary regex, we remove the unwanted '<div' and '<p' tags

### Cleaned data

The same approach has been used to extract all the required fields.

### 6. Close files

In [48]:
file.close()                                            #close xml file
file1.close()                                           #close json file

### 7. Clean JSON file
* Backslashes in the temporary json output file are cleaned as they can not be parsed by json

In [49]:
file3 = open("A3.txt","w")                              #cleaning json file
text = open("A2.txt",'r')
text1 = text.readlines()
abc = re.compile('\\\\|;')
for each in text1:
    each = abc.sub("",str(each))
    file3.write(each)
file3.close()

### 8. Print clean structured json file
* Clean json output file with indentation is generated

In [50]:
with open('task1_29812135.json','w') as outfile:                #printing extracted json file in clean json structure
    with open('A3.txt') as jsonfile:
        parsed = json.load(jsonfile)
        json.dump(parsed,outfile,indent=2)

## 9. Summary
* Hence with the use of Regex and json libraries, we have succesfully extracted data from semi structured format and converted it into structured forms i.e. .xml and .json