## COMP20008 2020 Semester 2 Workshop 2 ##

### Using lxml to read XML data



We will use the *lxml* python package.   *lxml* provides us with various methods of dealing with XML data known as APIs (Application Programming Interfaces). The first way is the ElementTree API, which enables us to easily access XML data in a tree-like structure.  The full API reference is availabld [here](https://lxml.de/api/index.html), but is probably less useful than the pandas API refernece you encountered last week.  The official lxml site does, however, have a [tutorial](https://lxml.de/tutorial.html) which is quite through and makes a great reference.

As with any other Python packages, you need to issue an import command to load a package:

In [1]:
from lxml import etree


For this section we will work with the royal.xml file, which contains the names of some members of the British royal family.  The code below simply displays the contents of that file, you can also open the file in a web browser or text editor.  Look through the file and ensure you understand its content.

In [2]:
f = open("royal.xml", "r")
text = f.read()
print(text)
f.close()

<?xml version="1.0" encoding="utf-8"?>
  <queen title="Queen Elizabeth II" marriedTo="Philip, Duke of Edinburgh">
      <prince title="Charles, Prince of Wales" marriedTo="Lady Diana Spencer">
		<prince title="Prince William of Wales" />
		<prince title="Prince Henry of Wales" />
      </prince>
      <princess title="Anne, Princess Royal" />
      <prince title="Andrew, Duke of York" />
      <prince title="Edward, Earl of Wessex" />
</queen>



In order to load an XML file and to represent it as a tree in computer memory, you need to parse the XML file. The etree.parse() function parses the XML file that is passed in as a parameter.  

In [3]:
xmltree = etree.parse("royal.xml")

The *parse()* function returns an XML *ElementTree* object, which represents the whole XML tree. Each node in the tree is translated into an *Element* object .

Use *getroot()* function of an *ElementTree* object to get the root element of the XML tree. You can print out the XML tag of an element using *tag* property.

In [4]:
type(xmltree)

lxml.etree._ElementTree

In [5]:
#To check the etree object is really XML
print(etree.tostring(xmltree, pretty_print=True, encoding="UTF-8"))

b'<queen title="Queen Elizabeth II" marriedTo="Philip, Duke of Edinburgh">\n      <prince title="Charles, Prince of Wales" marriedTo="Lady Diana Spencer">\n\t\t<prince title="Prince William of Wales"/>\n\t\t<prince title="Prince Henry of Wales"/>\n      </prince>\n      <princess title="Anne, Princess Royal"/>\n      <prince title="Andrew, Duke of York"/>\n      <prince title="Edward, Earl of Wessex"/>\n</queen>\n'


In [6]:
root = xmltree.getroot()
print (root.tag)
# indexing can only be used on one layer => queen is the root, its children on the same layer

queen


In [7]:
print(etree.tostring(root, pretty_print=True, encoding="UTF-8"))

b'<queen title="Queen Elizabeth II" marriedTo="Philip, Duke of Edinburgh">\n      <prince title="Charles, Prince of Wales" marriedTo="Lady Diana Spencer">\n\t\t<prince title="Prince William of Wales"/>\n\t\t<prince title="Prince Henry of Wales"/>\n      </prince>\n      <princess title="Anne, Princess Royal"/>\n      <prince title="Andrew, Duke of York"/>\n      <prince title="Edward, Earl of Wessex"/>\n</queen>\n'


In [8]:
type(root)

lxml.etree._Element

### Traversing the XML Tree

The following sections describe various methods for traversing the XML tree

To obtain a list all of the children of an element, you can iterate over the XML *Element* itself:

In [9]:
# iterate over very children of queen elizabath
for e in root:
   print (e.tag)

prince
princess
prince
prince


You can use indexing to access the children of an element:


In [10]:
oldest_prince = root[0] #getting first child of root 
#print(type(oldest_prince))
print (oldest_prince.get("title")) #every child of the root is also an element (tree) => recursive structure 

Charles, Prince of Wales


In [11]:
len(root)

4

In [12]:
print(oldest_prince.get("marriedTo"))

Lady Diana Spencer


The *find()* method returns only the first matching child.



In [13]:
the_first_child_with_prince_tag = root.find("prince")
print (the_first_child_with_prince_tag.get('title'))

Charles, Prince of Wales


The *iterchildren()* function allows you to iterate over children with a particular tag:



In [14]:
for child in root.iterchildren(tag="prince"):
    print (child.get('title'))

Charles, Prince of Wales
Andrew, Duke of York
Edward, Earl of Wessex


There is also a *iterdescendants()* function to iterate all descendants of a particular node.

### Exercise 1

Using the *royal.xml*:

i) Write Python code to get the title property of queen's grandsons.

ii) Write Python code to get the full title of the only princess in the family tree.

In [15]:
#insert answer to 1 here
prince = root[0]
for child in prince:
    print (child.get('title'))

Prince William of Wales
Prince Henry of Wales


In [16]:
#iterate over child of root, iterate over children with a particular tag 
for child in root:
    for grandchild in child.iterchildren(tag="prince"):
        print(grandchild.get('title'))

Prince William of Wales
Prince Henry of Wales


In [17]:
#second question: title of the only princess
the_first_princess = root.find('princess')
the_first_princess.get('title')

'Anne, Princess Royal'

### Accessing XML attributes


You can access the XML attributes of an element using the *get()* method
or *attrib* properties of an element.



In [18]:
print (root.attrib)

{'title': 'Queen Elizabeth II', 'marriedTo': 'Philip, Duke of Edinburgh'}


In [19]:
for child in root:
    print (child.tag)
    print (child.attrib)

prince
{'title': 'Charles, Prince of Wales', 'marriedTo': 'Lady Diana Spencer'}
princess
{'title': 'Anne, Princess Royal'}
prince
{'title': 'Andrew, Duke of York'}
prince
{'title': 'Edward, Earl of Wessex'}


In [20]:
print (root.get("title"))

Queen Elizabeth II


### Accessing XML text


This XML looks different to the *royal2.xml* in that it has some
text content within each element. To access the text content of an
element (text between start and end tag), use *text* properties of that
element

In [21]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()
for child in root:
    print (child.tag + ": " + child.text)

author: Salinger, J. D.
title: The Catcher in the Rye
language: English
publish_date: 1951-07-16
publisher: Little, Brown and Company
isbn: 0-316-76953-3
description: A story about a few important days in the life of Holden Caulfield


### Building XML data



Let's go back to the *book.xml* example above. As usual, use *lxml* library to parse the XML and get the root of the tree:



In [22]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()

To create a new XML element, use *etree.Element()* function:



In [23]:
new_element = etree.Element('genre') #upper case: not a method but a class/object in Python
new_element.text = 'Novel'
root.append(new_element)
print(etree.tostring(root[-1],pretty_print=True,encoding='unicode'))   # the last element, the newly appended element
# etree.tostring: print element of tree, root[-1] is the last element of root 

<genre>Novel</genre>



In [24]:
for child in root:
    print(child.tag + ":" + child.text)

author:Salinger, J. D.
title:The Catcher in the Rye
language:English
publish_date:1951-07-16
publisher:Little, Brown and Company
isbn:0-316-76953-3
description:A story about a few important days in the life of Holden Caulfield
genre:Novel


Tips: You can create a totally a new XML tree by constructing the root element:

In [25]:
## new XML tree of root - root 
root = etree.Element('book')

You can also create new element using *SubElement()* function:


In [26]:
new_element = etree.SubElement(root, "price")
new_element.text = '23.95'
for e in root: # check whether the new element is added
    print(e.tag, e.text)

price 23.95


Use *insert()* to insert a new element at a specific location:

In [27]:
#an alternative way to add another element into the tree
root.insert(1,etree.Element("country"))
root[1].text = "United States"
print(etree.tostring(root[1],pretty_print=True,encoding='unicode'))

<country>United States</country>



In [28]:
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

<book>
  <price>23.95</price>
  <country>United States</country>
</book>



### Serialising XML data (printing as web content or writing into a file)


You can get the whole XML string by calling *etree.tostring()* with the root of the tree as the first parameter:



In [29]:
output = etree.tostring(root, pretty_print=True, encoding="UTF-8")
for e in root:
   print(e.tag)

price
country


In [30]:
open('output.xml','wb').write(output)

73

### Exercise 2

Write Python code to load in the file "book.xml", change the ISBN to "Unknown" and then write out the file to "book-new.xml"

In [31]:
#insert answer to 2 here
xmlBook = etree.parse('book.xml')

In [32]:
rootBook = xmlBook.getroot()
rootBook.find('isbn').text = 'Unknown'

In [33]:
for child in rootBook:
    print(child.tag + ':' + child.text)

author:Salinger, J. D.
title:The Catcher in the Rye
language:English
publish_date:1951-07-16
publisher:Little, Brown and Company
isbn:Unknown
description:A story about a few important days in the life of Holden Caulfield


In [34]:
new_book = etree.tostring(rootBook, pretty_print=True, encoding ='UTF-8')
open('book-new.xml', 'wb').write(new_book)

346

In [35]:
type(new_book)

bytes

## JSON

Python has a built in json module that allows you to process JSON files.  You can find out more about it by reading [its page at python.org](https://docs.python.org/3/library/json.html).  W3schools also provide a good [introductory tutorial](https://www.w3schools.com/python/python_json.asp),  while Real Python has a [more comprehensive one](https://realpython.com/python-json/).

Below you can see a sample JSON file consisting of some information about a book.

In [36]:
str_json = '''
{
"id": "book001",
"author": "Salinger, J. D.",
"title": "The Catcher in the Rye",
"price": "44.95",
"language": "English",
"publish_date": "1951-07-16",
"publisher": "Little, Brown and Company",
"isbn": "0-316-76953-3",
"description": "A story about a few important days in the life of Holden Caulfield"
}
'''
#json object correspondes to python dictionary  

In [37]:
type(str_json)

str

Using the *json* library we are able to manipulate the JSON file as follows.

In [38]:
import json
Data = json.loads(str_json) #loads: load from a string 
print(type(Data))
print(Data["price"])

# modify any attribute
Data["isbn"] = "Unknown"

# save Json file
with open('book_test.json', 'w') as f:
    json.dump(Data, f,indent = 2) #to save a json object/ json dictionary => dump method

# load Json file
with open('book_test.json') as f:
    Data = json.load(f) #load: load from a file


<class 'dict'>
44.95


### Exercise 3
Add Spanish and German to the JSON file above as two extra languages represented as an array. Save this file as book2.json. Validate it on JSONLint.

In [39]:
Data['language'] = ['English', 'Spanish', 'German']

In [40]:
Data
#print(type(Data)) => dict => indicate it is already an json dictionary 

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': ['English', 'Spanish', 'German'],
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

In [41]:
with open('book2.json', 'w') as f:
    json.dump(Data, f, indent = 2) #json.dumps(str_json_modify, f, indent=2) =>dumps: dump string

In [42]:
# load and check the answer
with open('book2.json') as f:
    DataLang = json.load(f)    
DataLang

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': ['English', 'Spanish', 'German'],
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

### Exercise 4 (If you have time)
Now modify the publish date parameter. Make this an array of two objects that have
properties of edition (first, second) and date (1951-07-16,1979-01-01) respectively. Save
this file as book3.json.

In [43]:
Data["publish_date"] = [{"edition": "first", "date": "1951-07-16"}, {"edition": "second", "date": "1979-01-01"}]

In [44]:
Data

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': ['English', 'Spanish', 'German'],
 'publish_date': [{'edition': 'first', 'date': '1951-07-16'},
  {'edition': 'second', 'date': '1979-01-01'}],
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

In [45]:
with open('book3.json', 'w') as f:
    json.dump(Data, f, indent=4)