# FIT5196 Assessment 1 Task 3 Convert XML to JSON

#### Student Name: Vipul Krishnan M.D.
#### Student ID: 28104641

Date: 02/04/2018

Version: 1.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* re (version 2.2.1)
* json (version 2.0.9)

## 1. Introduction

This task focuses on converting the Austrailian Sport Thesaurus stored in an XML file ("australian-sport-thesaurus-student.xml") into a JSON file.

* the data has to be correctly extracted
* while extracting the thesaurus from the XML file, existing Python Packages that are written to parse XML files (e.g.,  Beautiful-soup, lxml and ElementTree) must not be used. 
* Python packages, like json, can be used to save the extracted thesaurus;
* the JSON data should be saved in a file named as "sport.dat";
* the input file must only be "australian-sport-thesaurus-student.xml".

## 2. Importing Libraries

In [None]:
import re
import json

## 3. Reading Data from File

The xml data is in the australian-sport-thesaurus-student.xml file. We can use the open() and read() method to read the data.

In [None]:
with open('australian-sport-thesaurus-student.xml', 'r', encoding="utf8") as myfile:
    data=myfile.read() # reading data from file and string it as 'data' variable

## 4. Converting XML to JSON

The XML consists of elements. All elements have starting and ending tags. An element can contain a value or other elements (Child elements). All XML has a root element which encloses all the other elements (w3schools.com, 2018).

JSON, on the other hand, has key value pairs. Value can be a simple string or another JSON object. JSON also supports arrays.
(w3schools.com, 2018)

Suppose if we have a simple xml like <a>x</a>, our task is to convert it as key value pair like "a":"x". The tag name becomes the key and the value in between the tags becomes the value.

For this tasks we first extract all tag names and values from the xml. For this we use regular expression as below.


In [None]:
tags = re.findall(r'<[^<^>]+>',data) # finds all text between '< and '>'. ie the tag values, including closing tags

for i in range(0,len(tags)):
    tags[i] = tags[i].strip('<').strip('>') # removes '<' and '>' from all the values

tags = tags[1:] # Removing the first tag which is just the prolog.

In [None]:
values = re.findall(r'<[^<^>^/]+>[^<^>]+</',data) # Find all substrings which has an open tag and a closing tag enclosed

for i in range(0,len(values)):
    values[i] = re.findall(r'>[^<^>]+</',values[i])[0] # for each extracted strings, extract the string starting with '>' and '</'
    values[i] = values[i].strip('/').strip('>').strip('<').replace('"','\\"').replace('\n', '\\n') # strip off '<','>' and '/' to get the values only
    

The JSON structure is very similar to the structure of a python dictionary. SO it would be easy to store the data in a dictionary first and then write it as a JSON file.

We need a function for this. The function and its explonation is below.

In [None]:
j = 0 # we define a global variable which is common for all recursive calls

def xmltodictionary(taglist): # Argument is the list of tags 
    
    stack = [] # we initialize a stack
    global j # declare the variable 'j' as global to access it (its already initialized before function)
    tname = taglist[0] # initialize tname as the first tag name. 
    
    if len(taglist) == 2: # if it is not a  parent element (ie only a single string as value)
        d = {} # initialize a directory
        d[taglist[0]] = values[j] # put tname as key and value as value
        j = j+1 # increase the global variable j (which curresponds to the position of the next value to be added)
        return d # return the dictionary
    
    else: # else
        d = {} # declare an empty dictionary
        gd = {} # this dictinonary is used as the value of the complex element
        init = 1 # init represents the starting position in array of the current child element
        stack.append(taglist[1]) # add first tagname in the list to stack
        i = 2 # initialize i as 2
        
        # the below while loop seperates each child element in the root tag of the given xml
        # it then recursively calls the function to process each child elements
        
        while (i < len(taglist)-1): # for each tags from position 2 to the second last 
            
            if '/'  == taglist[i][0]: # if it is a closing tag
                stack.pop() # then pop out the curresponding opening tag from the stack
                
                if len(stack) == 0: # if stack becomes empty, it means that a child element ends at this position
                    nd = xmltodictionary(taglist[init:i+1]) # we call the function recursively for that child element
                    
                    # the below loop checks if the tag name is repeating. In that case we have to use a list as JSON suppor list 
                    for k in nd.keys(): # for each returned key 
                        
                        if k not in gd.keys(): # it checks if the returned key is already returned previously.
                            gd[k] = nd[k] # if not its a unique key, so it can stay as independent
                        
                        else: # if the key is already there...
                            
                            if isinstance(gd[k], list): # and the value is already a list..
                                gd[k].append(nd[k]) # add the new value to the list
                            
                            else: # if it is currently a sinlge value..
                                gd[k] = [gd[k], nd[k]] # make it a list by adding the new value too
                
                i = i+1 # increase the i value to get the next tag
            
            else: # if it is not the end of a child element...
                stack.append(taglist[i]) # append the new tagname to stack
                
                if len(stack) == 1: # if it is the first tag of a child element..
                    init = i # then note this position as the initial tagname
                
                i = i+1 # increase the i value to get the next tag
            
        d[tname] = gd # add key as tname and value as gd, which is dictionary constructed after processing child elements
        return d # return the main dictionary

The above function is highly commented, so that it would be easier to understand.

Basically it uses a recursive function. Each function takes an XML structure and returns a dictionary as output in which there is only one key which is the tag name of the root element of the passed xml. The value would be a single string if the xml is a simple elements (ie contains no child elements), or else the value would be a dictionary which is obtained by recursively calling the function with each child element as parameters and merging up the returned dictionaries together. While merging, if there is same keys, it simply put the values together and make a list.

Now we can call this function with our entire taglist

In [None]:
j = 0; # to ensure the value starts taking from initial
dictionary = xmltodictionary(tags) # calling the function and saving the data as a dictionary

Now we have to convert this to JSON. As we are allowed to use python json package, the easiest way is to use json.dumps() method as below (Python, 2018). However in case this method is not permitted to use, I have created two manual functions to do the conversion. W will see the json.dumps method first.

In [None]:
json.dumps(dictionary)

This gives the standard JSON structure corresponding to our XML. However the the JSON in the given image has some differences.

* The root key name is different
* There is no key namely "Term". For example, instead of making like "Terms":{"Term":[{... it is briefed as "Terms":[{....
* Irrespective of the number related terms, all the related terms are presented as an array. ie even if there is only one related term, it is presented as an array.
* The order of the terms are different (However it is made clear in the forum that the order doesnt matter)

So we have to make these changes. For that we can define the following funtion

In [None]:
# This function changes structure of the dictionary to obtain JSON as in the image
def correct_dict(dictionary):
    
    new_dictionary = {} # a new dictitonary for returning 
    
    if not isinstance(dictionary, dict):
        return dictionary # if the parameter is not a dictionary,just return it.
    
    for key in dictionary.keys(): # for each keys of the dictionary
        
        if key == "Terms": # if the key is "Term"
            new_list = []
            
            for element in dictionary[key]["Term"]: # bypassing the key: "Term"
                new_list.append(correct_dict(element)) # and making a new list which becomes the value for the key "thesaurus", while recursively calling the function for child elements
                
            new_dictionary["thesaurus"] = new_list # changine the key name to "thesaurus" and adding the new list as value
            
        elif key == "RelatedTerms": # if the key is "Related terms"
            
            if not isinstance(dictionary[key]["Term"], list): # if the the terms are not present as a list
                new_dictionary[key] = [correct_dict(dictionary[key]["Term"])] # make it a list, also bypass the key "Term", while recursively calling the function for child elements
            
            else:
                new_dictionary[key] = correct_dict(dictionary[key]["Term"]) # else just bypass the key "Term", while recursively calling the function for child elements
    
        else:
            new_dictionary[key] = correct_dict(dictionary[key]) # else just add as in the original dictionary, while recursively calling the function for child elements
            
    return new_dictionary
        

The above function recursively goes through the parameter dictionary and all its child dictionaries. 

If the key is "Terms", it simply renames it to "thesaurus"and also bypassess the succeeding key, which is "Term"

If the key is "RelatedTerms", bypasses the succeeding key, which is "Term", and also makes sure that the value is a list

Else it just add child elements as in the original dictionary.

No we can call this function and make JSON string

In [None]:
newdic = correct_dict(dictionary) # correcting the dictionary to match with the style in the image

In [None]:
json.dumps(newdic) # trying with json.dumps

As mentioned above, if using json.dumps method is not permitted we can obtain the same by using the below manually made functions.

In [None]:
# this fuction converts a dictionay to JSON formatted string
def dicttostr(d):
    st = ''
    dk = d.keys()
    st = st + "{"
    for i in range(len(dk)):
        k = list(dk)[i]
        if isinstance(d[k], str): # if the value is string
            st = st+"\"" + k + "\":\"" + d[k] + "\"" 
            if i != len(dk)-1:
                st = st+","
        elif isinstance(d[k], list): # if the value is a list
            st = st+"\"" + k + "\":" + listtostr(d[k]) 
            if i != len(dk)-1:
                st = st+","
        else:
            st = st+"\"" + k + "\":" + dicttostr(d[k]) 
            if i != len(dk)-1:
                st = st+","
    st = st + '}'
    return st

# this function converts a list to string representation as in json format
def listtostr(l):
    st = '['
    for i in range(len(l)):
        if isinstance(l[i], str): # if the element is a string
            st = st + "\"" + l[i] + "\""
            if i != len(l)-1:
                st = st + ','
        elif isinstance(l[i], list): # if the element is a list
            st = st + listtostr(l[i])
            if i != len(l)-1:
                st = st + ','
        else:
            st = st + dicttostr(l[i]) # if the lemenet is a dictionary
            if i != len(l)-1:
                st = st + ','
    st = st + ']'
    return st
   

In [None]:
dicttostr(newdic) # printing using the manual methods

but as the json.dumps() function is more efficient and fast, we can use this one.

Now we need to write it to file. We can use the write() method for this.

In [None]:
file = open('sport.dat','w', encoding="utf-8") # creating / opening the file for writing
file.write(json.dumps(newdic)) # writing the data to file

## 5. Summary

We have successfully converted the xml to json using the below steps

* read the xml data and seperated the tags values using python re
* using a recursive function, a dictionary is created from tags and values list which currespinds to the xml data
* using a recursive function, the structure of the dictionary is changed to match with the JSON fomrat in the given image
* the dictionary is converted to json. we either use json.dumps() method or a manually writtten method. 
* the json data is written in sport.dat file

## 6. References

- W3Schools.com. (2018). *XML Tutorial*. Retrieved from https://www.w3schools.com/xml/default.asp
- W3Schools.com. (2018). *JSON Tutorial*. Retrieved from https://www.w3schools.com/js/js_json_intro.asp
- Python Software Foundation. (2018). *JSON encoder and decoder (Documentation)*. Retrieved from https://docs.python.org/2/library/json.html


