# Parsing Raw Text Files- Text Pre-processing
####  Name: Kshitij Patil
#### Email: kshitijpatilw@gmail.com

Date:

Environment: Python 3.x and Jupyter notebook
Libraries used: 
* regex - a library extension for python for pattern matching (regular expression)
* json - used for converting Dict objects or String(like dict) to JSON specification (json.dumps)



## 1. Introduction

We have been given web scraped data of HTML format from the MONASH website. The file is of a semi-structured format which contains the information of Units, unit-code,unit title, synopsis, requirements, output, chief examiner preteaning to a paticular course.We need to congregate the above mentioned data into Extensible Markup Language `JSON` and JavaScript Object Notation `XML` dataformat used for data serilization and interchange because it easily parsable in machine (and used in almost all web based API's).

The Following data is needs to be extracted from the file format:

1. The __code__ of the Unit for a course. ( _3/4 Letters with 4 digits_ )
2. The __Pre-requistes__ associated with a course.( _pre-requistes and co-requistes_ )
3. The __Prohibisions__ associated with a course.
4. __Synopsis__ of the course.
5. __Requirements__ for the course.
6. __Outputs__ after the course is done.
7. __Chief-examiner__ associated marking the course. ( _None should be TBA_ )

If a course does not have any of the above is should be replaced by _NA_ . 

After Getting the data needs to outputted into JSON and XML. For JSON we will create dictionary which will later on be converting into JOSN dumps and XML will be creating using string.

## 2.  Import libraries 

In [388]:
import re
import json

## 3.Loading the data and other dependecies

Checking the the data that has been provided.

In [389]:
#Loading the data in the format from the given file.
with open('29519136.txt','r') as f:
    file = f.read()
#checking how the data looks    
file[:1000]    

'<div class="content-inner__main">\n<!-- breadcrumbs -->\n<nav class="breadcrumbs mobile-hidden" id="breadcrumbs">\n<p class="visuallyhidden" id="breadcrumb__label">You are here:</p>\n<ul aria-labelledby="breadcrumb__label" class="breadcrumbs__list">\n<li class="breadcrumbs__item home">\n<a class="breadcrumbs__link" href="https://www.monash.edu/">Home</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item">\n<a class="breadcrumbs__link" href="https://www.monash.edu/study">Study</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item">\n<a class="breadcrumbs__link" href="/pubs/2019handbooks/">2019 Handbooks</a>\n<span aria-hidden="true" class="breadcrumbs__divider">|</span>\n</li>\n<li class="breadcrumbs__item breadcrumbs__current"><a class="breadcrumbs__link" href="/pubs/2019handbooks/units">Units</a></li>\n</ul>\n</nav>\n<div class="hbk-banner-box">\n<h1 class="banner_med"><span class="u

From the above data we have a rough insite how the data looks.

### 3.1 Intial Analysis after Loading the Data

Lets check out how the data looks. We from the previous data file can make the conclusion that the each course is nested with div containers with who contain the unique value and __`<!-- breadcrumbs -->`__ and 
The number of lines can be counting the `\n` 
The number of complete tags (openning and closing) can be found by counting `<\.*>` all closing tags

In [390]:
lines = file.count('\n')
print('The number of lines ≈ ',lines)
tags = "</.*?>"
tag_lines = len(re.findall(tags,file,re.DOTALL|re.MULTILINE))
print('The number of tags ≈',tag_lines)
div_start = "<!-- breadcrumbs -->"
number_div = len(re.findall(div_start,file,re.DOTALL|re.MULTILINE))
print('The number of lines pretaining to each course = ',number_div)

The number of lines ≈  39088
The number of tags ≈ 33984
The number of lines pretaining to each course =  400


### 3.2 The HTML structure:

We in section 3.1 one made the conclusion that all the unique/individual courses are stored inside div containers whose starting is marked by __`<!-- breadcrumbs -->`__ and ending is marked by __`<!-- /.content_container--> </div>`__.

__Why do we need to remove individuls chunks of containers?__
The answer is intutive when we first look at the chunks and see that any regex that is genrated for common among all the divs.
1. Null Value for all regexs that are not-matched.
2. Getting Unique Value for regexs that are common among divs.

This concept is used for even further regexs clean.
Breaking the problems into chunks is much less costly on the runtime as compared to complicated regexs (refrence= https://blog.codinghorror.com/regex-performance/)

In [391]:
div_start_end = "<!-- breadcrumbs -->(.*?)<!-- /.content_container--> </div>"
var_explaination = re.findall(div_start_end,file,re.DOTALL)
print(var_explaination[2])


<nav class="breadcrumbs mobile-hidden" id="breadcrumbs">
<p class="visuallyhidden" id="breadcrumb__label">You are here:</p>
<ul aria-labelledby="breadcrumb__label" class="breadcrumbs__list">
<li class="breadcrumbs__item home">
<a class="breadcrumbs__link" href="https://www.monash.edu/">Home</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item">
<a class="breadcrumbs__link" href="https://www.monash.edu/study">Study</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item">
<a class="breadcrumbs__link" href="/pubs/2019handbooks/">2019 Handbooks</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item breadcrumbs__current"><a class="breadcrumbs__link" href="/pubs/2019handbooks/units">Units</a></li>
</ul>
</nav>
<div class="hbk-banner-box">
<h1 class="banner_ada"><span class="unitcode">MDC5320</span> - Multimedia design studio 3<span class="hbk-archive

### 3.3 MAKING REGEX FOR TAGS:

What we can make out of the blocks is that each item is encapsulated between certain tags and the structure of the tags is as following:
1. Unitcode is in between`<span class="unitcode">` and `<span>`**
2. Synopsis is in between`Synopsis</h2>`and `</div>`
3. Pre-requistes is in between `Pre-requistes</p>`and `<p class="hbk-preamble-heading">Prohibitions`
4. Requirements is in between`<h2 class="hbk-heading">Assessment<\/h2>`and `</div>`
5. Unit title is in between`<span class="unitcode">` after `</span>\s`and `<span>`**
6. Chief Examiner is in between`Chief examiner</h2>`and `</p>`
7. Outcomes is in between `Outcomes</h2>`and `</div>`

__NOTE:__ This will only get individual tags and then we can do further pre-processing on the indivdual tags

These ** tags will need to be cleaned further

In [392]:
Outcomes2 = r'(?<=Outcomes</h2>).*?(?=</div>)'
synopsis2 = r'(?<=Synopsis</h2>\n\n).*?(?=</div>)'
chief_examiner = r'(?<=Chief\sexaminer\(s\)</p>).*?(?=</p>)'
prereq2 = r'(?<=Prerequisites</p>).*?(?=<p class="hbk-preamble-heading">Prohibitions)'  
prohibitions = r'(?<=Prohibitions</p>).*?(?=</div>)'
requirements2 = r'(?<=<h2 class="hbk-heading">Assessment<\/h2>).*?(?=\/div>)' 
unit_title = r'<span class="unitcode">.*?</span>\s-(.*?)<span'
the_new_trial = r'(?<=<span class="unitcode">).*?(?=</span>)'

### 3.3.1 Segregations each Tag 
Lets have a look how the tag looks

In [393]:
print(re.findall(Outcomes2,file,re.DOTALL)[1])
print(re.findall(synopsis2,file,re.DOTALL|re.MULTILINE))
print(re.findall(chief_examiner,file,re.DOTALL)[1])
print(re.findall(prereq2,file,re.DOTALL)[1])
print(re.findall(prohibitions,file,re.DOTALL)[1])
print(re.findall(requirements2,file,re.DOTALL)[1])
print(re.findall(unit_title,file,re.DOTALL)[1])
print(re.findall(the_new_trial,file,re.DOTALL[1]))


<div>
<p>Upon successful completion of this unit students will be able to:</p>
<ol princestart="0" start="1" type="1">
<li>Apply independent research, problem-solving methodologies and advanced technical skills to plan and manage complex multimedia design solutions from initial concept to final resolution;</li>
<li>Demonstrate an advanced level of proficiency in the design and production of a multimedia product;</li>
<li>Communicate ideas and concepts to critically reflect, evaluate and justify their own multimedia design project;</li>
<li>Demonstrate an extensive understanding of the multimedia design discipline and its professional practices, within the scope of a specified multimedia design project;</li>
<li>Proficiently present multimedia design concepts in a logical and informed manner that has relevancy to a specified target audience, and;</li>
<li>Understand and apply the rules of occupational health and safety appropriate to the discipline practice.</li>
</ol>

[]

<p>
<a href

TypeError: 'RegexFlag' object does not support indexing

### 3.3.2 Cleaning of the Segregated Tags
What is observalbe from above is that the documents have been accumulated under their tag but still the required values are not coming therefore after seggregartion we need to further scrutnize the data to remove the needed values for which we will create further regexs. All the li elements(row) need to alsoe be kept as lists.
The regex will be as follows:

In [394]:
#GET ALL THE VALUES WITH '>' '<'
chief_clean = r'(?<=">).*?(?=<)'
#GET ALL ELEMNTS OF 3 to 4 WORDS or 4 DIGITS
pro_clean =r'\w{3,4}\d{4}'
#TO REMOVE THE LAST BIT OF P TAGS
req_clean = r'(?<=<p>).*?(?=</p>)'
#TO REMOVE UNCLEAN DATA
rem_tag =r'<.*?>'
#TO FIND LI ELEMENTS AND RETURN
list_li =r'(?<=<li>).*?(?=</li>)'

#### TYPICAL CASES
Lets check for the Values after we have got the clean. The below code shows all the __typical cases__ (_corner cases_) that we may occur in all the tags. _i have identified 3_

In [395]:
#HERE WE HAVE MADE HE CASE FOR LI ELEMENTS THEREFORE BUT WE ALSO NEED TO KEEP IN MIND NO LI PARTS
print('VALUE WITH LISTS = '+re.search(list_li,re.findall(Outcomes2,file,re.DOTALL)[1],re.DOTALL).group(0))
print('\n\n\n')
#THIS IS THE CASE WHERE THE TAG IS OF ONLY ONE VALUE SYNOPSIS IS ANOTHER EXAMPLE OF THIS
print('SINGLE VALUE = '+re.search(the_new_trial,file,re.DOTALL).group(0))
print('\n\n\n')
#THE SAME WILL BE THE CASE WITH PROHIBISIONS AS PREREQUISTES
print('LIST FORMAT = '+str(re.findall(the_new_trial,re.findall(prereq2,file,re.DOTALL)[1],re.DOTALL)))




VALUE WITH LISTS = Apply independent research, problem-solving methodologies and advanced technical skills to plan and manage complex multimedia design solutions from initial concept to final resolution;




SINGLE VALUE = MSM5200




LIST FORMAT = ['BFC3140', 'FST3711']


## 4 CLEANING OF THE FILE (COMBINING EVERYTHING):

Now that we have seen the above lets create the entire file and look at the answer where we get all the values for each course container then we will have the look at the output for all the values within those tags. The steps are as follows:

1. Get the container.  (as seen in 3.2)
2. Sub-contanierize each variable. (as seen in 3.3.1)
3. The Clean (extract) Values of the data. (as seen in 3.3.2)
4. Again Check the values for remaning dirty data.

__In the next section__

5. Change it to the required format. 

Lets Start-


In [396]:
#THIS CODE IS ONLY TO CAUSE A DELAY IN OUTPUT AS NOTEBOOK HAS A SET LIMIT IT CAN PROCESS PER SECOND
def time_delay():
    for x in range(0,10000):
        pass
    

In [397]:
#GETTING ALL THE TASKS INTO ONE SET LIST SO THAT WE CAN ITERATE OVER IT THIS WILL HELP TO FORM THE FIRST CONTAINER

#Gives a list of starting index of the 
regex_div_start =[m.end(0) for m in re.finditer(div_start,file)] 
regex_div_end = [m.start(0) for m in re.finditer(div_end,file)]
start_end = zip(regex_div_start,regex_div_end)

In [398]:
# SEGREGATION OF ALL SUB CONTAINERS:
unit = re.compile(unit_title)
p = re.compile(the_new_trial,re.DOTALL)
s = re.compile(Outcomes2,re.DOTALL|re.MULTILINE)
q = re.compile(synopsis2,re.DOTALL|re.MULTILINE)
c = re.compile(chief_examiner,re.DOTALL|re.MULTILINE)
pre = re.compile(prereq2,re.DOTALL|re.MULTILINE)
pro = re.compile(prohibitions,re.DOTALL|re.MULTILINE)
req = re.compile(requirements2,re.DOTALL|re.MULTILINE)

#CLEANING OF THE CONTAINER:
ch_cl = re.compile(chief_clean,re.DOTALL)
req_cl = re.compile(req_clean,re.DOTALL|re.MULTILINE)
req_li = re.compile(list_li,re.DOTALL|re.MULTILINE)
pro_uni = re.compile(pro_clean,re.DOTALL|re.MULTILINE)

In [399]:
'''
This function takes all an input dictionary with key=tag,value=lists_of_values
1.It serves the function of cleaning unwanted tags that might come in.
2.It also finds the <p> tags and splits it accordingly to give new list
3.Also serves the pruposes of finding and opening list of one value
'''
def check_empty_single(the_dict):
    for keys in the_dict.keys():
        if len(the_dict[keys]) == 0:
            pass
        elif the_dict[keys] == '\n':
            the_dict[keys] = 'NA'
        #Check for <p> if <p> split and return results between the tag   
        elif '<p>' in the_dict[keys]:
            the_dict[keys] = re.findall(req_clean,the_dict[keys],re.DOTALL|re.MULTILINE)
        #check for any tags or key that has some tag or has one value    
        elif len(the_dict[keys]) == 1 or '<' in the_dict[keys]:
            the_dict[keys] = re.sub(rem_tag,'',the_dict[keys][0],re.DOTALL|re.MULTILINE)
        #If its a list clean all the value within the lines    
        elif str(type(the_dict[keys])) == "<class 'list'>":
            inner_list = []
            for value in the_dict[keys]:
                inner_list.append(re.sub(rem_tag,'',value,re.DOTALL|re.MULTILINE))
            the_dict[keys] = inner_list
    return the_dict


In [400]:
'''
This Chunk of code does the major segregation this code gathers all the data into one chunk of code and then
It will later be used to create JSON and XML objects 
It stores the values into list of clean chunks of dictionary which has all the values for each code
'''
count = 0
outer_list = []
#THIS WILL BE HEADER OF THE FILE OF THE XML
string2 = '<?xml version="1.0" encoding="UTF-8"?>'

#GET THE INDEXES FOR THE CONTAINER
for number in start_end:
    count += 1

    the_dict = {}
    
    #>>>>  SECTION SELECTION   
    #GETS ALL THE STRING FROM INDEX START TO END
    outcomes = s.findall(file[number[0]:number[1]])
    chief = c.findall(file[number[0]:number[1]])
    syn = q.findall(file[number[0]:number[1]])
    req_2 = req.findall(file[number[0]:number[1]])
    pro_2 = pro.findall(file[number[0]:number[1]])
    pre_2 = pre.findall(file[number[0]:number[1]])
    
    
    #FINDS THE VALUES ACCORDING TO THE REGEX ABOVE
    the_dict['@id'] = p.findall(file[number[0]:number[1]])
    
    try:
        the_dict['title'] = unit.search(file[number[0]:number[1]]).group(1)
    except:
        the_dict['title'] = 'NA'
    

    

    #>>>>>> CLEANING FINAL VALUE
    if syn == "\n\n":
        the_dict['Synopsis'] = 'NA'
    elif len(syn) != 0:
        the_dict['Synopsis']=syn
    else:
        the_dict['Synopsis'] = 'NA'

    if len(outcomes) != 0 and '<li>' in outcomes[0] :
        the_dict['Outcomes']=req_li.findall(outcomes[0])
    elif len(outcomes) != 0:
        the_dict['Outcomes']=outcomes
    else:
        the_dict['Outcomes'] = 'NA'    
    
    if len(req_2) != 0 and '<li>' in req_2[0] :
        the_dict['Requistics']=req_li.findall(req_2[0]) #NOTE: need to add the code for requistic[:-1]
    elif len(req_2) != 0:
        the_dict['Requistics']=req_cl.findall(req_2[0])
    else:
        the_dict['Requistics'] = 'NA'    

    if len(chief) != 0:
        the_dict['Cheif_examiners']=ch_cl.findall(chief[0])
    else:
        the_dict['Cheif_examiners'] = 'NA'    

    if len(pro_2) != 0:
        the_dict['Prohibisions']=list(set(pro_uni.findall(pro_2[0])))
    else:
        the_dict['Prohibisions'] = 'NA'  

    if len(pre_2) != 0:
        the_dict['prerequistes']=list(set(pro_uni.findall(pre_2[0])))
    else:
        the_dict['prerequistes'] = 'NA'        
    the_dict = check_empty_single(the_dict)
    #STORE ALL THE VALUES IN THE DICT
    outer_list.append(the_dict)
    
    


## 5 Writing the Files:
The Files are written in the format of XML the xml is string. I have written the file in the form of a loop which take the value from the the dictionary of values and outputs the a string of XML
### 5.1 Writing into XML

In [401]:
for the_dict in outer_list:
    if the_dict['@id']: 
        string2 += '<unit id="'+ the_dict["@id"] + '">'
        for keys,values in the_dict.items():
            if str(type(values)) != "<class 'list'>" and keys != '@id':
                string2 += "<"+str(keys)+">" +str(values)+ "</"+str(keys)+">\n"
            elif keys != '@id':
                string2 += "<"+str(keys)+">\n"
                for value in values:
                    string2 += "<"+str(keys[:-1])+">" +str(value)+ "</"+str(keys[:-1])+">\n"
                string2 += "</"+str(keys)+">\n"
        string2 += '</unit>\n'                     
string2+='</units>'

In [402]:
#THIS WRITES INTO THE FILE 
with open('29519136.xml','w') as f:
    f.write(string2)

### 5.2 Writing into JSON
The code goes through a loop and then prints it into a JSON format.

In [403]:
json_list = []
dict_without_subdict= ['@id','title','Synopsis']
for the_dict in outer_list:
    #print(the_dict)
    for keys,values in the_dict.items():
        #THIS PICE OF CODE HAS BEEN ADDED AS JUPYTER NOTEBOOK HAS A RATE AT WHICH IT CAN IO TO PUT A DELAY
        time_delay()
        
        if keys == '@id':
            sub_dic = {}
            sub_dic[keys] = the_dict[keys]
        elif values == 'NA':
            sub_dic[keys] = the_dict[keys]
            #json_list.append(sub_dic)
        elif keys in dict_without_subdict:
            sub_dic[keys] = the_dict[keys]
            #json_list.append(sub_dic) 
        else:
            sub_dic[keys] = {}
            sub_dic[keys][keys[:-1]] = values
    json_list.append(sub_dic)

print(json_list)           

[{'@id': 'MSM5200', 'title': ' Advanced studies in biomedical sciences MUM', 'Synopsis': 'NA', 'Outcomes': 'NA', 'Requistics': 'NA', 'Cheif_examiners': {'Cheif_examiner': 'Associate Professor Md. Ezharul Hoque Chowdhury'}, 'Prohibisions': 'NA', 'prerequistes': 'NA'}, {'@id': 'ACF5953', 'title': ' Financial accounting', 'Synopsis': 'NA', 'Outcomes': {'Outcome': ['describe and compare the regulatory requirements, domestic and international, associated with the preparation of general purpose financial statement for companies', "apply and critique the accounting rules for entities' investments in other entities, and apply these rules to prepare consolidated financial statements", 'analyse a number of measurement and financial reporting issues and their possible resolution, including: accounting for income tax, post-acquisition accounting for assets, and business combinations', 'develop capabilities to work effectively in a group to produce professional quality research reports; effective i

In [404]:
#Adds a the boilerplate that is required to genrate a json of format {units:{unit:[the_cleaned_list]}}
boiler_plate = {}
boiler_plate['units'] = {}
boiler_plate['units']['unit'] = json_list

Till this point the json is of the same format as a dict there are couple of things that are needed to be changed like the encoding but we dont have to worry about it since we can directly dump the dictionary using JSON dumps which changes the format of the dictionary into JSON format.

In [405]:
#This is where the dumps will take place
json_obj =json.dumps(boiler_plate,indent=2)
time_delay()
print(json_obj)
with open('29519136.json','w') as j:
    j.write(json_obj)
    
    

{
  "units": {
    "unit": [
      {
        "@id": "MSM5200",
        "title": " Advanced studies in biomedical sciences MUM",
        "Synopsis": "NA",
        "Outcomes": "NA",
        "Requistics": "NA",
        "Cheif_examiners": {
          "Cheif_examiner": "Associate Professor Md. Ezharul Hoque Chowdhury"
        },
        "Prohibisions": "NA",
        "prerequistes": "NA"
      },
      {
        "@id": "ACF5953",
        "title": " Financial accounting",
        "Synopsis": "NA",
        "Outcomes": {
          "Outcome": [
            "describe and compare the regulatory requirements, domestic and international, associated with the preparation of general purpose financial statement for companies",
            "apply and critique the accounting rules for entities' investments in other entities, and apply these rules to prepare consolidated financial statements",
            "analyse a number of measurement and financial reporting issues and their possible resolution, includi

## 6 Conclusion:

This project gave us the insight of getting data from semisturctured data and how to munge/wrangle it. It uses a real world example of wrangling a data which helps us understand better the workings of wrangling. We also come to understand that even in semi-structured Data is not always similar in pattern and newer more adaptive techniques should be used tackle such issues. We also learnt real world data format like JSON and XML and how to write them using python giving it format.