## ElementTree

`ElementTree` from the standard library and `lxml` are the most prevalent tools in the Python world for processing XML.

The ElementTree library is part of the Python standard library

In [1]:
import xml.etree.ElementTree as etree

The primary entry point for the ElementTree library is the `parse()` function, which can take a filename or a file-like object. This function parses the entire document at once. 
If memory is tight, there are ways to parse an XML document incrementally instead.

In [2]:
tree = etree.parse('feed.xml')

The `parse()` function returns an object which represents the entire document. 
This is not the root element. 
To get a reference to the root element, call the `getroot()` method.

In [3]:
root = tree.getroot()

In the example file `feed.xml` the root element is the feed element in the http://www.w3.org/2005/Atom namespace. 
The string representation of this object reinforces an important point: 
an XML element is a combination of its namespace and its tag name (also called the local name). 
Every element in this document is in the Atom namespace, 
so the root element is represented as {http://www.w3.org/2005/Atom}feed.

In [4]:
root

<Element '{http://www.w3.org/2005/Atom}feed' at 0x10b8e7db0>

*ElementTree represents XML elements as `{namespace}localname`. You’ll see and use this format in multiple places in the ElementTree `API`.*

In [5]:
root.tag

'{http://www.w3.org/2005/Atom}feed'

The “length” of the root element is the number of child elements.

In [6]:
len(root)

8

An element can be used as an iterator to loop through all of its child elements. The list of child elements only includes `direct` children.

In [7]:
for child in root:
    print(child)

<Element '{http://www.w3.org/2005/Atom}title' at 0x10b8e7e50>
<Element '{http://www.w3.org/2005/Atom}subtitle' at 0x10b8e7ef0>
<Element '{http://www.w3.org/2005/Atom}id' at 0x10b8ed040>
<Element '{http://www.w3.org/2005/Atom}updated' at 0x10b8ed0e0>
<Element '{http://www.w3.org/2005/Atom}link' at 0x10b8ed220>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8ed2c0>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8eda90>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8edf90>


### Attributes Are Dictonaries

Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.

In [8]:
root.attrib

{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

In [9]:
root[4]

<Element '{http://www.w3.org/2005/Atom}link' at 0x10b8ed220>

In [10]:
root[4].attrib

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

In [11]:
root[3]

<Element '{http://www.w3.org/2005/Atom}updated' at 0x10b8ed0e0>

In [12]:
# The updated element has no attributes, 
# so its .attrib is just an empty dictionary.
root[3].attrib

{}

## Searching For Nodes Within An XML Document

### findall()

Each element — including the root element, but also child elements — has a `findall()` method. It finds all matching elements among the element’s children.

In [13]:
tree

<xml.etree.ElementTree.ElementTree at 0x10b8eb0a0>

In [14]:
# We will need to use the namespace a lot, so we make this shortcut
namespace = '{http://www.w3.org/2005/Atom}'

In [15]:
root.findall(f'{namespace}entry')

[<Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8ed2c0>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8eda90>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8edf90>]

In [16]:
root.tag

'{http://www.w3.org/2005/Atom}feed'

In [17]:
# This query returns an empty list because the root
# element 'feed' does not have any child element 'feed'
root.findall(f'{namespace}feed')

[]

In [18]:
# This query only finds direct children. The author nodes are nested,
# therefore this query returns an empty list
root.findall(f'{namespace}author')

[]

For convenience, the `tree` object (returned from the `etree.parse()` function) has several methods that mirror the methods on the root element. The results are the same as if you had called the `tree.getroot().findall()` method.

In [19]:
tree.findall(f'{namespace}entry')

[<Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8ed2c0>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8eda90>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x10b8edf90>]

In [20]:
tree.findall(f'{namespace}author')

[]

### find()

The `find()` method takes an ElementTree query returns the first matching element. This is useful for situations where you are only expecting one match, or if there are multiple matches, you only care about the first one.

In [21]:
entries = tree.findall(f'{namespace}entry')
len(entries)

3

In [22]:
# Get the first title, secretly we know there is only one title
title_element = entries[0].find(f'{namespace}title')
title_element.text

'Dive into history, 2009 edition'

There are no elements in this entry named `foo`, so this returns `None`.

In [23]:
foo_element = entries[0].find(f'{namespace}foo')
foo_element

In [24]:
type(foo_element)

NoneType

**Beware:** In a boolean context, ElementTree element objects will evaluate to `False` if they contain no children (i.e. if `len(element)` is 0). This means that if `element.find('...')` is not testing whether the find() method found a matching element; it’s testing whether that matching element has any child elements! To test whether the `find()` method returned an element, use if `element.find('...') is not None`.

### Search for descendant elements

A query like `//{http://www.w3.org/2005/Atom}link` with the two slashes at the beginning finds any elements, regardless of nesting level.

In [25]:
all_links = tree.findall(f'.//{namespace}link')
all_links

[<Element '{http://www.w3.org/2005/Atom}link' at 0x10b8ed220>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x10b8ed590>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x10b8edcc0>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x10b8ef130>]

In [26]:
all_links[0].attrib

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

In [27]:
all_links[1].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'}

In [28]:
all_links[2].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'}

In [29]:
all_links[3].attrib

{'rel': 'alternate',
 'type': 'text/html',
 'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'}

ElementTree’s `findall()` method is a very powerful feature, but the query language can be a bit surprising. ElementTree’s query language is similar enough to XPath to do basic searching, but dissimilar enough that it may annoy you if you already know XPath.

## Parsing with lxml

`lxml` utilizes the popular `libxml2` parser. It provides a 100% compatible ElementTree API, then extends it with full XPath 1.0 support and a few other niceties.

In [30]:
# We will need to use the namespace a lot, so we make this shortcut
namespace = '{http://www.w3.org/2005/Atom}'

In [31]:
from lxml import etree
tree = etree.parse('feed.xml')
root = tree.getroot()
root.findall(f'{namespace}entry')

[<Element {http://www.w3.org/2005/Atom}entry at 0x10bf9ab80>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x10bf9abc0>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x10bf9ac00>]

For large XML documents `lxml` is significantly faster than the `built-in` ElementTree library. If you’re only using the ElementTree API and want to use the fastest available implementation, you can try to import `lxml` and fall back to the built-in ElementTree.

In [32]:
try:
    from lxml import etree
except ImportError:
    import xml.etree.ElementTree as etree

The following query finds all elements in the Atom namespace, anywhere in the document, that have an `href` attribute. The `//` at the beginning of the query means “elements anywhere (not just as children of the root element).” `{http://www.w3.org/2005/Atom}` means “only elements in the Atom namespace.” `*` means “elements with any local name.” And `[@href]` means “has an href attribute.”

In [33]:
tree.findall(f'//{namespace}*[@href]')

[<Element {http://www.w3.org/2005/Atom}link at 0x10bfa6280>,
 <Element {http://www.w3.org/2005/Atom}link at 0x10bfa62c0>,
 <Element {http://www.w3.org/2005/Atom}link at 0x10bfa6300>,
 <Element {http://www.w3.org/2005/Atom}link at 0x10bfa6340>]

In [34]:
tree.findall(f"//{namespace}*[@href='http://diveintomark.org/']")

[<Element {http://www.w3.org/2005/Atom}link at 0x10bfa6280>]

In [35]:
# Using NS as name of the namespace variable is a cool idea
NS = '{http://www.w3.org/2005/Atom}'

The following query searches for Atom `author` elements that have an Atom `uri` element as a child. This only returns two `author` elements, the ones in the first and second `entry`. The `author` in the last `entry` contains only a `name`, not a `uri`. 

In [36]:
tree.findall(f'//{NS}author[{NS}uri]')

[<Element {http://www.w3.org/2005/Atom}author at 0x10bfa6600>,
 <Element {http://www.w3.org/2005/Atom}author at 0x10bfa6840>]

### XPath support in lxml


Technically an XPath expressions returns a list of nodes. (Thats what the DOM of a parsed XML document is made up of). Depending on their type, nodes can be elements, attributes, or even text content. 

To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is just a Python dictionary.

In [37]:
NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}

The XPath expression searches for `category` elements (in the Atom namespace) that contain a `term` attribute with the value `accessibility`. The `/..` bit means to return the parent element of the category element you just found.
So this single XPath query will find all entries with a child element of `<category term='accessibility'>`.
In this case the `xpath()` function returns a list of ElementTree objects.

In [38]:
entries = tree.xpath("//atom:category[@term='accessibility']/..", 
                    namespaces=NSMAP)
entries

[<Element {http://www.w3.org/2005/Atom}entry at 0x10bf9abc0>]

The following query returns a list that contains a string. It selects text content (`text()`) of the title element (`atom:title`) that is a child of the current element (`./`).

In [39]:
# Pick the first (and only) element from the entries list
entry = entries[0]
# It is an ElementTree object and therefore supports
entry.xpath('./atom:title/text()', namespaces=NSMAP)

['Accessibility is a harsh mistress']

## Generating XML

You can create XML documents from scratch.

In [40]:
import xml.etree.ElementTree as etree
atom_NS = '{http://www.w3.org/2005/Atom}'
w3_NS = '{http://www.w3.org/XML/1998/namespace}'

To create a new element, instantiate the `Element` class. You pass the element name (namespace + local name) as the first argument. This statement creates a `feed` element in the Atom namespace. This will be our new document’s root element.

To add attributes to the newly created element, pass a dictionary of attribute names and values in the `attrib` argument. Note that the attribute name should be in the standard ElementTree format, `{namespace}localname`.

In [41]:
new_feed = etree.Element(f'{atom_NS}feed',
                        attrib={f'{w3_NS}lang': 'en'})

At any time, you can serialize any element (and its children) with the ElementTree `tostring()` function.

In [42]:
print(etree.tostring(new_feed))

b'<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en" />'


### Default namespaces

A default namespace is useful for documents — like Atom feeds — where every element is in the same namespace. The namespace is declared once and each element just needs to be declared with its local name (`<feed>`, `<link>`, `<entry>`). There is no need to use any prefixes unless you want to declare elements from another namespace.
    
The first snippet has an default, implicit namespace.

```xml
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/>
```

Whereas the second, this is how `ElementTree` serializes namespaced XML elements, has an explicit namespace. This is technically accurate, but a bit cumbersome to work with. 
    
Both `DOM`s of the serialisations in the example are identical.

```xml
<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>
```
    
`lxml` does offer fine-grained control over how namespaced elements are serialized. The built-in `ElementTree` does not. 

In [43]:
# We import lxml's etree like this, to make it recognizeable 
# in the example
import lxml.etree

Define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are the desired prefix. Using `None` as a prefix effectively declares a default namespace.

In [44]:
NSMAP = {None: 'http://www.w3.org/2005/Atom'}

Now you can pass the `lxml`-specific `nsmap` argument when you create an element, and `lxml` will respect the namespace prefixes you’ve defined.

In [45]:
new_feed = lxml.etree.Element('feed', nsmap=NSMAP)

This serialization defines the Atom namespace as the default namespace and declares the feed element without a namespace prefix.

In [46]:
print(lxml.etree.tounicode(new_feed))

<feed xmlns="http://www.w3.org/2005/Atom"/>


In [47]:
# Aha, .tounicode() would be one way to get a string instead of 
# a byte object
print(lxml.etree.tostring(new_feed))

b'<feed xmlns="http://www.w3.org/2005/Atom"/>'


You can always add attributes to any element with the `set()` method. It takes two arguments: the attribute name in standard ElementTree format, then the attribute value. This method is not `lxml`-specific.

In [48]:
new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')
print(lxml.etree.tounicode(new_feed))

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/>


### Create child elements

Instantiate the `SubElement` class to create a child element of an existing element. The only required arguments are the parent element (`new_feed` in this case) and the new element’s name. Since this child element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or prefix here.

You can also pass in an attribute dictionary. Keys are attribute names; values are attribute values.

In [49]:
title = lxml.etree.SubElement(new_feed, 'title', 
                              attrib={'type':'html'})
print(lxml.etree.tounicode(new_feed))

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></feed>


Set the `.text` property to add the text content to an element. 

In [52]:
title.text = 'dive into &hellip;'
print(lxml.etree.tounicode(new_feed))

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into &amp;hellip;</title></feed>


In [53]:
print(lxml.etree.tounicode(new_feed, pretty_print=True))

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title type="html">dive into &amp;hellip;</title>
</feed>



## Parsing broken xml

`lxml` is capable of parsing not wellformed xml documents.

The parser chokes on this document, because because the `&hellip;` entity is not defined in XML.

In [56]:
import lxml.etree
tree = lxml.etree.parse('broken-feed.xml')

XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28 (broken-feed.xml, line 3)

Instantiate the `lxml.etree.XMLParser` class to create a custom parser. It can take a number of different named arguments. Here we are using the `recover` argument, so that the XML parser will try its best to “recover” from wellformedness errors.

In [57]:
parser = lxml.etree.XMLParser(recover=True)

This works! The second argument of `parse()` is the custom parser.

In [59]:
tree = lxml.etree.parse('broken-feed.xml', parser)

The parser keeps a log of the wellformedness errors that it has encountered.

In [60]:
parser.error_log

broken-feed.xml:3:28:FATAL:PARSER:ERR_UNDECLARED_ENTITY: Entity 'hellip' not defined

In [61]:
tree.findall('{http://www.w3.org/2005/Atom}title')

[<Element {http://www.w3.org/2005/Atom}title at 0x10e022d80>]

The parser just dropped the undefined `&hellip;` entity. The text content of the title element becomes 'dive into '.

In [64]:
title = tree.findall('{http://www.w3.org/2005/Atom}title')[0]
title.text

'dive into '

As you can see from the serialization, the &hellip; entity didn’t get moved; it was simply dropped.

In [63]:
print(lxml.etree.tounicode(tree.getroot()))

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>dive into </title>
</feed>
