# ElementTree XML Tutorial



Extensible Markup Language (XML) Files are inherently hierarchial data files. The most natural way to represent this data is with a tree. This tutorial will introduce the basics on reading/writing to XML files using their respective tree representation. 

Before we begin the tutorial, some background information on XML files should be noted. XML is a markup language that establishes guidelines for encoding documents in a format that is both human-readable and machine-readable. XML is similar to HTML in that they both represent web data but there is a key difference. HTML was designed to display data while XML was designed to describe the content.

Chugging along... The ElementTree library is a standard python library so only one line of code is needed to import the library. We will also import the minidom library to help visualize our XML trees. 

We will first go over how to create an XML file from scratch and then we will parse a pre-existing XML file using what learned about the library. There will also be instructions later on how to load the data needed for the tutorial.

In [2]:
import xml.etree.ElementTree as ET
from xml.dom import minidom

# Generating an XML File

Suppose now we want to generate an XML file about popular fighter characters. The root of the tree will be Fighters and you can then visualize this in the form of a tree. The Fighters tree has subelements or nodes representing individual characters while these characters then have leaf nodes representing their stats. There is no limitation to how large a tree can be but we'll keep it to six characters with two stats each just for our example.We'll also have some fun and see who the strongest fighter is! So visually there are three layers to the tree: a root node, fighters layer, stats layer. Sort of like below..

                    Fighters
    Saitama  Popeye  Cyborg  Chun-Li  Snake
    HP ATT   HP ATT  HP ATT  HP ATT   HP ATT

In [132]:
#Establish the root of the tree
root = ET.Element('Fighters')

#Create a subelement for each fighter and set the stats
OPM = ET.SubElement(root, 'Saitama')
opmHealth = ET.SubElement(OPM, "Health")
opmHealth.text = "1000"
opmAttack = ET.SubElement(OPM, "Attack")
opmAttack.text = "9999"

pe = ET.SubElement(root, "Popeye")
peHealth = ET.SubElement(pe, "Health")
peHealth.text = "100"
peAttack = ET.SubElement(pe, "Attack")
peAttack.text = "2"

cb = ET.SubElement(root, "Cyborg")
cbHealth = ET.SubElement(cb, "Health")
cbHealth.text = "10"
cbAttack = ET.SubElement(cb, "Attack")
cbAttack.text = "10"

chun = ET.SubElement(root, "Chun-Li")
chunHealth = ET.SubElement(chun, "Health")
chunHealth.text = "100"
chunAttack = ET.SubElement(chun, "Attack")
chunAttack.text = "100"

sk = ET.SubElement(root, "Snake")
skHealth = ET.SubElement(sk, "Health")
skHealth.text = "2"
skAttack = ET.SubElement(sk, "Attack")
skAttack.text = "100"

noob = ET.SubElement(root, "Newb")
noobHealth = ET.SubElement(noob, "Health")
noobHealth.text = "1"
noobAttack = ET.SubElement(noob, "Attack")
noobAttack.text = "1"

#This will be used to help visualize the tree
viz = ET.tostring(root, 'utf-8')
parsed = minidom.parseString(viz)
print parsed.toprettyxml(indent="  ")



<?xml version="1.0" ?>
<Fighters>
  <Saitama>
    <Health>1000</Health>
    <Attack>9999</Attack>
  </Saitama>
  <Popeye>
    <Health>100</Health>
    <Attack>2</Attack>
  </Popeye>
  <Cyborg>
    <Health>10</Health>
    <Attack>10</Attack>
  </Cyborg>
  <Chun-Li>
    <Health>100</Health>
    <Attack>100</Attack>
  </Chun-Li>
  <Snake>
    <Health>2</Health>
    <Attack>100</Attack>
  </Snake>
  <Newb>
    <Health>1</Health>
    <Attack>1</Attack>
  </Newb>
</Fighters>



As you can see from the code above and the output, the Element XML Tree is just represented as a connected graph of elements and subelements. In the code we created subelements of respective characters under the root element "Fighter" and then further created more subelements of stat values for each fighter. For those familar with some markup language already, you'll notice that the elements in our tree are just tags that we have defined. 

Below we'll now use this new tree we made and parse it iteratively in some simple for loops. The root of the tree has sub elements as seen above so we'll just print the children and their stats.

In [133]:
print "Root of the tree is: " + root.tag

#Accessing the various nodes of the Fighters tree
maxAggScore = 0
for child in root:
    print child.tag
    HP =  child.findtext('Health')
    Att = child.findtext('Attack')
    print HP, Att
    score = int(HP) + int(Att)
    
    #Let's use our new parsing information to determine who the the strongest fighter is!
    if score > maxAggScore:
        maxAggScore = score
        name = child.tag
        strongestHP = HP
        strongestAtt = Att
        

print "Strongest Fighter is... " + name 
print "His HP is a whopping... " + strongestHP
print "His Attack is an astounding... " + strongestAtt

Root of the tree is: Fighters
Saitama
1000 9999
Popeye
100 2
Cyborg
10 10
Chun-Li
100 100
Snake
2 100
Newb
1 1
Strongest Fighter is... Saitama
His HP is a whopping... 1000
His Attack is an astounding... 9999


# Introducing Namespaces


Now that we are a little more familar with the basic structure of an XML Tree, we will introduce a feature called namespaces. Namespaces are very useful in representing data in both human-readable and machine-readable formats as we can have a certain namespace with various fields within. Namespaces are essentially categories that contain multiple subobjects. In the example below we will construct an XML Tree of Animals. The Animals tree will then use namespace to construct different kinds of animals and certain qualities about them. 

In [134]:
animals = ET.Element('Animals')

#Setting the namespace ID for the animals
a1 = ET.SubElement(animals, 'animal', id='Penguin')
a2 = ET.SubElement(animals, 'animal', id='Koala')
a3 = ET.SubElement(animals, 'animal', id='Frog')
a4 = ET.SubElement(animals, 'animal', id='Mouse')

#Initialize the location sub elements for each animal
loc1 = ET.SubElement(a1, 'location')
loc1.text = "Antarctica"
loc2 = ET.SubElement(a2, 'location')
loc2.text = "Australia"
loc3 = ET.SubElement(a3, 'location')
loc3.text = "Swamp"
loc4 = ET.SubElement(a4, 'location')
loc4.text = "Hole In Wall"

#Set namespace ID for the respective food each animal eats
food1 = ET.SubElement(a1, 'food', id='fish')
food2 = ET.SubElement(a2, 'food', id='eucalyptus')
food3 = ET.SubElement(a3, 'food', id='flies')
food4 = ET.SubElement(a4, 'food', id='cheese')

rough_string = ET.tostring(animals, 'utf-8')
reparsed = minidom.parseString(rough_string)
print reparsed.toprettyxml(indent="  ")

<?xml version="1.0" ?>
<Animals>
  <animal id="Penguin">
    <location>Antarctica</location>
    <food id="fish"/>
  </animal>
  <animal id="Koala">
    <location>Australia</location>
    <food id="eucalyptus"/>
  </animal>
  <animal id="Frog">
    <location>Swamp</location>
    <food id="flies"/>
  </animal>
  <animal id="Mouse">
    <location>Hole In Wall</location>
    <food id="cheese"/>
  </animal>
</Animals>



# Parsing Namespace parameters

Phenomenal! We have now generated an XML file using namespace parameters. We can now represent most webpages as a tree of values with id fields. However an important concept is now parsing these XML files. It's important to know how to use the ElementTree XML library to parse through the XML files and retrieve the data we want. The code below shows how given our previous example of the Animals Tree. We loop through every ID in that namespace. We can then get the actual name with the get command on 'id'. Location is not represented as a namespace so we just find the tag and return the text. Food is a namespace so we find the tag and call get on the 'id'.

Penguin:
 --> Antarctica
 --> Fish
 
Koala:
 --> Australia
 --> Eucalyptus
 
Frog:
 --> Swamp
 --> Flies

Mouse:
 --> Hole In Wall
 --> Cheese

In [135]:
for animal in animals.findall('animal'):
    name = animal.get('id')
    loc = animal.find('location').text
    food = animal.find('food').get('id')
    print name, loc, food

Penguin Antarctica fish
Koala Australia eucalyptus
Frog Swamp flies
Mouse Hole In Wall cheese


# Altering the Element Tree

It's also possible to find certain elements within a file and completely remove them. This can be useful if the file is large but we know a specific id to eliminate. In the example below I want to remove Saitama from the fighters tree and all aspects about him. I find the element in the root and just remove that in place. In the output it can be seen that Saitama is now gone. This find method is very useful. There may be moments where we don't know how large a file is or don't want to search the whole tree to delete a value. If we know the id of the value then we can just call find to retrieve it.

In [98]:
elem = root.find('Saitama')
root.remove(elem)

viz = ET.tostring(root, 'utf-8')
parsed = minidom.parseString(viz)
print parsed.toprettyxml(indent="  ")

<?xml version="1.0" ?>
<Fighters>
  <Popeye>
    <Health>100</Health>
    <Attack>2</Attack>
  </Popeye>
  <Cyborg>
    <Health>10</Health>
    <Attack>10</Attack>
  </Cyborg>
  <Chun-Li>
    <Health>100</Health>
    <Attack>100</Attack>
  </Chun-Li>
</Fighters>



# Generating XML Files

We have now covered the Element Object (class xml.etree.ElementTree.Element) in the ElementTree XML API. I will now go on to discuss another important object within the API called the ElementTree Object (class xml.etree.ElementTree.ElementTree). The two may sound similar and you could be wondering why we need this new object but there is actually a very fine distinction between the two. The Element Object defined the element interface and implemented all the methods and attributes for a specific sub element within a tree. Sure we could still represent the root as an Element object which would then have subelements but this was technically not a tree yet. Accessing the root would just give us the top element and not the entire tree. 

The new ElementTree Object takes a root element and represents the entire element hierarchy. It also has added support for serialization to and from standard XML which I will demonstrate.


In [119]:
tree = ET.ElementTree()
tree._setroot(animals)
print tree.getroot()

tree.write("output.xml")

<Element 'Animals' at 0x7fd93118b350>


In the block of code above we generated a new instance of the ElementTree object called tree. After the first line we have only just initialized this instance so there are no elements yet. However we take our previous example of the animals tree we made and set that as the root. We can verify this by using the method 'getroot()' on tree which lists the root element. The best part about using the ElementTree object now is that we can write to a new xml file. Using that new tree element we created, we can save the hierarchy as the file 'output.xml' which you can check in your home directory.

# Parsing Real Data Set

Great! We have covered a lot of basics such as how the tree hierarchy works, intializing our own trees, and creating new XML files. Now let's discuss a real world scenario of parsing a simple pre-existing XML file. We shall use a small example which incorporates most of what we discussed so far without being too convoluted. Now let's grab an XML file.

We will be using data on L.A. zoo attendance for the past five years. Go to the link https://data.lacity.org/api/views/3gwn-arjr/rows.xml?accessType=DOWNLOAD then right click and save as an XML file, I called it 'LA.xml'.


In [100]:
LA = ET.parse('LA.xml')
print type(LA)
laRoot = LA.getroot()
print laRoot

<class 'xml.etree.ElementTree.ElementTree'>
<Element 'response' at 0x7fd9311ed490>


The ElementTree XML API has a built in method called parse which was used above. Using ET.parse(file_path), it generated an xml.etree.ElementTree.ElementTree object as seen above. Then there is a method called getroot() which returns to you the root element of the tree. Phenomenal! We've covered a lot of basics on ElementTree objects so far and now we have generated one based on an XML file from the web. Now let's take some time on accessing individual elements in this tree.

In [107]:
for child in laRoot.iter('row'):
    print child.attrib
    att = child.findtext('attendance')
    fisc = child.findtext('fiscal_year')
    print "Attendance in year " + str(fisc) + " was " + str(att) 

{}
Attendance in year None was None
{'_address': 'http://data.lacity.org/resource/_3gwn-arjr/1', '_id': '1', '_uuid': '4936580F-0829-4C96-B389-598BB53D3862', '_position': '1'}
Attendance in year 2012-13 was 1506274
{'_address': 'http://data.lacity.org/resource/_3gwn-arjr/2', '_id': '2', '_uuid': 'D74A48AD-64DA-4C9B-8A9F-0695D19919F6', '_position': '2'}
Attendance in year 2011-12 was 1660450
{'_address': 'http://data.lacity.org/resource/_3gwn-arjr/3', '_id': '3', '_uuid': 'B6643375-0C48-45C1-9949-2585A62AFB57', '_position': '3'}
Attendance in year 2010-11 was 1543232
{'_address': 'http://data.lacity.org/resource/_3gwn-arjr/4', '_id': '4', '_uuid': 'C6B3EE93-8332-4ADE-A0D1-92F053DE15E9', '_position': '4'}
Attendance in year 2009-10 was 1459080
{'_address': 'http://data.lacity.org/resource/_3gwn-arjr/5', '_id': '5', '_uuid': 'EA611366-347D-46F5-9BC0-10834CE063FD', '_position': '5'}
Attendance in year 2008-09 was 1556162


Going through the XML tree of L.A. zoo attendance data, we easily parsed through all the children nodes of the root element. In a for loop we can access both the fiscal year and attendance with the child.findtext(tag) method. Using this, both the year and attendance for that year were represented in a user readable format. Although this was a very small example of data extracted from a real source, the skills learned can now be applied to any XML file. No matter how large or convoluted the file is, we can easily parse the data by treating it as a tree. 

Just for a little extra help though, another example will now be covered with a much larger XML file. In a similar fashion as earlier, download the XML file from this link https://data.austintexas.gov/api/views/ykw4-j3aj/rows.xml?accessType=DOWNLOAD

This is a file about dangerous dog incidents and their locations. We will parse the XML file for the name of the owner with the dangerous dog, the address, and the description of the dog.

In [123]:
#Open the file and grab the root
dogs = ET.parse('dog.xml')
dogRoot = dogs.getroot()

#Iterate through the subelements of the root node
for child in dogRoot.iter('row'):
    
    #Retrieve all fields we want in each subelement.
    firstName = child.findtext('first_name')
    lastName = child.findtext('last_name')
    address = child.findtext('address')
    zipcode = child.findtext('zip_code')
    report = child.findtext('description_of_dog')
    print firstName, lastName
    print address
    print zipcode
    print report



None None
None
None
None
Lorena Zuniga
3415 Sweetgum Trc
78713
“Mulligan,” neutered male, Brindle Bullmastiff
Maria Davila
4420 Dovemeadow Dr
78744
“Tiny,” male, tan and white Boxer mix
Matthew  Rafacz
7400 Espira Drive 
78739
"Charlie" neutered male, black and white Labrador Retriever mix 
Jeff Crawford
9321 Bavaria Ln.
78749
"Nala" spayed female, white and brown brindle Pit Bull mix
Katherine  Maloney
11504 Murcia Dr
78759
“Lexie,” female, white and black Pit Bull  
Jack Barnett
13101 Winding Creek Rd
78736
"Holly" Spayed female, white Labrador/Pitbull mix
Carla Ward
7128 Mumruffin Ln
78754
“Lincoln,” male, fawn and white Pit Bull Terrier
Melissa Spellmann
2815 Oak Ridge Dr
78669
“Sparkles,” spayed female, Brindle Plott Hound mix
Ruth Delong-Pyron
903 Vincent Place 
78660
"Missy," Spayed Female, red/white Pitbull mix
Ronald Vasey
4704 Sunridge Ct
78741
“Rita,” female, brown Australian Shepherd  
Timothy  LeBlanc
5931 Cape Coral Dr
78746
"Miles Davis," female, gold/white Golden Retrie

# Summary and references

This tutorial highlighted all the major points of utilizing the ElementTree XML API. Referenced data and more details are included below. XML files are everywhere, use the this tutorial along with the resources below to go explore the world of Extensible Markup Language files!

1. https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.find
2. https://data.lacity.org/api/views/3gwn-arjr/rows.xml?accessType=DOWNLOAD
3. https://docs.python.org/2/library/xml.dom.minidom.html