### Using lxml to read XML data



We will use the *lxml* python package.   *lxml* provides us with various methods of dealing with XML data known as APIs (Application Programming Interfaces). The first way is the ElementTree API, which enables us to easily access XML data in a tree-like structure.

As with any other Python packages, you need to issue an import command to load a package:

In [1]:
from lxml import etree


In order to load an XML file and to represent it as a tree in computer memory, you need to parse the XML file. The etree.parse() function parses the XML file that is passed in as a parameter.  We first load in the file that you created in question 2. 

In [2]:
xmltree = etree.parse("royal2.xml")

The *parse()* function returns an XML *ElementTree* object, which represents the whole XML tree. Each node in the tree is translated into an *Element* object .

Use *getroot()* function of an *ElementTree* object to get the root element of the XML tree. You can print out the XML tag of an element using *tag* property.

In [3]:
root = xmltree.getroot()
print (root.tag)

queen


### Traversing the XML Tree

The following sections describe various methods for traversing the XML tree

To obtain a list all of the children of an element, you can iterate over the XML *Element* itself:

In [4]:
for e in root:
   print (e.tag)

prince
princess
prince
prince


You can use indexing to access the children of an element:


In [5]:
oldest_prince = root[0]
#print(type(oldest_prince))
print (oldest_prince.get("title"))

Charles, Prince of Wales


The *find()* method returns only the first matching child.



In [6]:
the_first_child_with_prince_tag = root.find("prince")
print (the_first_child_with_prince_tag.get('title'))

Charles, Prince of Wales


The *iterchildren()* function allows you to iterate over children with a particular tag:



In [7]:
for child in root:#.iterchildren(tag="prince"):
    print (child.get('title'))

Charles, Prince of Wales
Anne, Princess Royal
Andrew, Duke of York
Edward, Earl of Wessex


There is also a *iterdescendants()* function to iterate all descendants of a particular node.

###### Exercise 5a)

Using the *royal2.xml*:

i) Write Python code to get the title property of queen's grandsons.

ii) Write Python code to get the full title of the only princess in the family tree.

In [8]:
#insert answer to 5a) here


from lxml import etree # import the library
xmltree = etree.parse("royal2.xml")
root = xmltree.getroot()
# Write a Python code to get the title property of queen's grandsons.
for child in root: # iterate over prince and princess under queen
    for grandson in child.iterchildren(tag="prince"):
        print (grandson.get('title'))

        
        # Write a Python code to get the full title of the only princess in the family tree.
the_only_princess = root.find("princess")
print (the_only_princess.get('title'))

Prince William of Wales
Prince Henry of Wales
Anne, Princess Royal


### Accessing XML attributes


You can access the XML attributes of an element using the *get()* method
or *attrib* properties of an element.



In [9]:
print (root.attrib)
print (root.get("title"))


{'title': 'Queen Elizabeth II', 'marriedTo': 'Philip, Duke of Edinburgh'}
Queen Elizabeth II


### Accessing XML text


This XML looks different to the *royal2.xml* in that it has some
text content within each element. To access the text content of an
element (text between start and end tag), use *text* properties of that
element

In [10]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()
for child in root:
    print (child.tag + ": " + child.text)

author: Salinger, J. D.
title: The Catcher in the Rye
language: English
publish_date: 1951-07-16
publisher: Little, Brown and Company
isbn: 0-316-76953-3
description: A story about a few important days in the life of Holden Caulfield


### Building XML data



Let's go back to the *book.xml* example above. As usual, use *lxml* library to parse the XML and get the root of the tree:



In [11]:
from lxml import etree
xmltree = etree.parse('book.xml')
root = xmltree.getroot()

To create a new XML element, use *etree.Element()* function:



In [12]:
new_element = etree.Element('genre')
new_element.text = 'Novel'
root.append(new_element)
print(etree.tostring(root[-1],pretty_print=True,encoding='unicode'))   # the last element, the newly appended element


<genre>Novel</genre>



Tips: You can create a totally a new XML tree by constructing the root element:

In [13]:
root = etree.Element('book')

You can also create new element using *SubElement()* function:


In [14]:
new_element = etree.SubElement(root, "price")
new_element.text = '23.95'
for e in root: # check whether the new element is added
    print(e.tag)

price


Use *insert()* to insert a new element at a specific location:

In [15]:
root.insert(1,etree.Element("country"))
root[1].text = "United States"
print(etree.tostring(root[1],pretty_print=True,encoding='unicode'))

<country>United States</country>



### Serialising XML data (printing as web content or writing into a file)


You can get the whole XML string by calling *etree.tostring()* with the root of the tree as the first parameter:



In [16]:
output = etree.tostring(root, pretty_print=True, encoding="UTF-8")
for e in root:
   print(e.tag)

price
country


In [17]:
open('output.xml','wb').write(output)

73

## Exercise 5b)

Write Python code to load in the file "book.xml", change the ISBN to "Unknown" and then write out the file to "book-new.xml"

In [18]:
#insert answer to 5b) here

xmltree = etree.parse("book.xml")
root = xmltree.getroot()
root.find("isbn").text='Unknown'
output = etree.tostring(root, pretty_print=True,encoding="UTF-8")
open('book-new.xml','wb').write(output)

346

This is the end of the notebook.  Now return to question 6) in the exercises sheet.

## Exercise 7

In your JSON solution, add Spanish and German as two extra languages represented
as an array. Save this le as book2.json. Validate it on JSONLint.
Now modify the publish date parameter. Make this an array of two objects that have
properties of edition (rst, second) and date (1951-07-16,1979-01-01) respectively. Save
this le as book3.json.
                                               

In [19]:
str_json = '''
{
"id": "book001",
"author": "Salinger, J. D.",
"title": "The Catcher in the Rye",
"price": "44.95",
"language": "English",
"publish_date": "1951-07-16",
"publisher": "Little, Brown and Company",
"isbn": "0-316-76953-3",
"description": "A story about a few important days in the life of Holden Caulfield"
}
'''

In [20]:
import json
Data = json.loads(str_json)
print(type(Data))
print(Data["price"])

# modify any attribute
Data["isbn"] = "Unknown"

# save Json file
with open('book_test.json', 'w') as f:
    json.dump(Data, f,indent = 2)

# load Json file
with open('book_test.json') as f:
    Data = json.load(f)
    

<class 'dict'>
44.95


In [21]:
#insert answer to 6 here

Data["language"] = ["English","Spanish","German"]
with open('book2.json', 'w') as f:
    json.dump(Data, f,indent = 2)
    
    

In [22]:
with open('book2.json') as f:
    Data = json.load(f)

In [23]:
Data

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': ['English', 'Spanish', 'German'],
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

In [24]:
#insert answer to 7b here

obj1= {"edition":"first","date":"1951-07-16"}
obj2= {"edition":"second","date":"1979-01-01"}
Data["publish_date"]=[obj1,obj2]

In [2]:
with open('book3.json', 'w') as f:
    json.dump(Data, f,indent = 2)
with open('book3.json') as f:
    Data = json.load(f)

NameError: name 'json' is not defined

In [1]:
Data['id']

NameError: name 'Data' is not defined