## ElementTree

`ElementTree` from the standard library and `lxml` are the most prevalent tools in the Python world for processing XML.

The ElementTree library is part of the Python standard library

In [1]:
import xml.etree.ElementTree as etree

The primary entry point for the ElementTree library is the parse() function, which can take a filename or a file-like object. This function parses the entire document at once. 
If memory is tight, there are ways to parse an XML document incrementally instead.

In [2]:
tree = etree.parse('feed.xml')

The `parse()` function returns an object which represents the entire document. 
This is not the root element. 
To get a reference to the root element, call the getroot() method.

In [3]:
root = tree.getroot()

The root element is the feed element in the http://www.w3.org/2005/Atom namespace. 
The string representation of this object reinforces an important point: 
an XML element is a combination of its namespace and its tag name (also called the local name). 
Every element in this document is in the Atom namespace, 
so the root element is represented as {http://www.w3.org/2005/Atom}feed.

In [4]:
root

<Element '{http://www.w3.org/2005/Atom}feed' at 0x1094f10e0>

*ElementTree represents XML elements as `{namespace}localname`. You’ll see and use this format in multiple places in the ElementTree `API`.*

In [5]:
root.tag

'{http://www.w3.org/2005/Atom}feed'

The “length” of the root element is the number of child elements.

In [6]:
len(root)

8

An element can be used as an iterator to loop through all of its child elements. The list of child elements only includes `direct` children.

In [7]:
for child in root:
    print(child)

<Element '{http://www.w3.org/2005/Atom}title' at 0x1094f1180>
<Element '{http://www.w3.org/2005/Atom}subtitle' at 0x1094f1220>
<Element '{http://www.w3.org/2005/Atom}id' at 0x1094f1310>
<Element '{http://www.w3.org/2005/Atom}updated' at 0x1094f13b0>
<Element '{http://www.w3.org/2005/Atom}link' at 0x1094f14f0>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1590>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1d60>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f52c0>


## Attributes Are Dictonaries

Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.

In [8]:
root.attrib

{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

In [9]:
root[4]

<Element '{http://www.w3.org/2005/Atom}link' at 0x1094f14f0>

In [10]:
root[4].attrib

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

In [11]:
root[3]

<Element '{http://www.w3.org/2005/Atom}updated' at 0x1094f13b0>

In [12]:
# The updated element has no attributes, 
# so its .attrib is just an empty dictionary.
root[3].attrib

{}

## Searching For Nodes Within An XML Document

### findall()

Each element — including the root element, but also child elements — has a `findall()` method. It finds all matching elements among the element’s children.

In [13]:
tree

<xml.etree.ElementTree.ElementTree at 0x1094e3b20>

In [14]:
# We will need to use the namespace a lot, so we make this shortcut
namespace = '{http://www.w3.org/2005/Atom}'

In [15]:
root.findall(f'{namespace}entry')

[<Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1590>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1d60>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f52c0>]

In [16]:
root.tag

'{http://www.w3.org/2005/Atom}feed'

In [17]:
# This query returns an empty list because there the root
# element 'feed' does not have any child element 'feed'
root.findall(f'{namespace}feed')

[]

In [18]:
# This query only finds direct children. The author nodes are nested,
# therefore this query returns an empty list
root.findall(f'{namespace}author')

[]

For convenience, the `tree` object (returned from the `etree.parse()` function) has several methods that mirror the methods on the root element. The results are the same as if you had called the `tree.getroot().findall()` method.

In [19]:
tree.findall(f'{namespace}entry')

[<Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1590>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f1d60>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x1094f52c0>]

In [20]:
tree.findall(f'{namespace}author')

[]

### find()

The `find()` method takes an ElementTree query returns the first matching element. This is useful for situations where you are only expecting one match, or if there are multiple matches, you only care about the first one.

In [21]:
entries = tree.findall(f'{namespace}entry')
len(entries)

3

In [22]:
# Get the first title, secretly we know there is only one title
title_element = entries[0].find(f'{namespace}title')
title_element.text

'Dive into history, 2009 edition'

There are no elements in this entry named `foo`, so this returns `None`.

In [23]:
foo_element = entries[0].find(f'{namespace}foo')
foo_element

In [24]:
type(foo_element)

NoneType

**Beware:** In a boolean context, ElementTree element objects will evaluate to `False` if they contain no children (i.e. if `len(element)` is 0). This means that if `element.find('...')` is not testing whether the find() method found a matching element; it’s testing whether that matching element has any child elements! To test whether the `find()` method returned an element, use if `element.find('...') is not None`.

### Search for descendant elements

In [25]:
all_links = tree.findall(f'.//{namespace}link')
all_links

[<Element '{http://www.w3.org/2005/Atom}link' at 0x1094f14f0>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x1094f1860>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x1094f1f90>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x1094f5400>]

In [26]:
all_links[0].attrib

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

In [27]:
all_links[1].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'}

In [28]:
all_links[2].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'}

In [29]:
all_links[3].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'}