# What is XPath?

## XPath is a language

> XPath is a language for addressing parts of an XML document 
> 
> -- <cite>[XML Path Language 1.0](https://www.w3.org/TR/xpath/)</cite>


XPath [data model](http://www.w3.org/TR/xpath/#data-model) is a tree of nodes:
* element nodes (`<p>...</p>`)
* attribute nodes (`href="page.html"`)
* text nodes (`"Some Title"`)
* comment nodes (`<!-- a comment -->`)
* (and 3 other types that we won’t cover here.)

## Why learn XPath?

* navigate **everywhere** inside a DOM tree
* a must-have skill for accurate web data extraction
* more powerful than CSS selectors
* fine-grained look at the text content
* complex conditioning with axes
* extensible with custom functions (we won’t cover that in this talk though)

Also, it’s kind of fun :-)

## XPath data model

Let's use this sample HTML page to illustrate how XPath works:

```
<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>
```

In [1]:
import libxml2

In [2]:
htmlsample = '''<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type" />
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br />
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>'''

In [3]:
doc = libxml2.htmlReadDoc(htmlsample, 'http://www.example.com', 'utf-8', 0)

Everything is a node in XPath data model: elements, attributes, comments...

And nodes  have an order, the **document order**, the order they appear in the XML/HTML source.

In [4]:
from collections import OrderedDict
from asciitree import LeftAligned

order = 0
def traverse(node, accept=('text', 'attribute', 'comment'), ignore_empty_text=True):
    global order
    children_list = OrderedDict()
    _name = "#%2d--<%s>" % (order, node.name)
    _type = node.get_type()
    if node.isText():
        if 'text' not in accept:
            return
        if ignore_empty_text and not node.content.strip():
            return
        _content = '#%2d--(TXT): %r' % (order, node.content)
    elif _type == 'attribute':
        if 'attribute' not in accept:
            return
        _content = '#%2d--(ATTR): %s: %r' % (order, node.name, node.content)
    elif _type == 'comment':
        if 'comment' not in accept:
            return
        _content = '#%2d--(COMM): %r' % (order, node.content)
    elif _type == 'document_html':
        _content = '#%2d--(ROOT)' % (order,)
    else:
        _content = _name
    
    for child in node.xpathEval('(child::node() | attribute::*)'):
        order += 1
        r = traverse(child, accept=accept, ignore_empty_text=ignore_empty_text)
        if r is not None:
            children_list.update(r)
            
    return {_content: children_list}


tr = LeftAligned()
print(tr(OrderedDict(traverse(doc, ignore_empty_text=False))))

# 0--(ROOT)
 +-- # 1--<html>
     +-- # 2--(TXT): '\n'
     +-- # 3--<head>
     |   +-- # 4--(TXT): '\n  '
     |   +-- # 5--<title>
     |   |   +-- # 6--(TXT): 'This is a title'
     |   +-- # 7--(TXT): '\n  '
     |   +-- # 8--<meta>
     |   |   +-- # 9--(ATTR): content: 'text/html; charset=utf-8'
     |   |   +-- #10--(ATTR): http-equiv: 'content-type'
     |   +-- #11--(TXT): '\n'
     +-- #12--(TXT): '\n'
     +-- #13--<body>
     |   +-- #14--(TXT): '\n  '
     |   +-- #15--<div>
     |   |   +-- #16--(TXT): '\n    '
     |   |   +-- #17--<div>
     |   |   |   +-- #18--(TXT): '\n      '
     |   |   |   +-- #19--<p>
     |   |   |   |   +-- #20--(TXT): 'This is a paragraph.'
     |   |   |   +-- #21--(TXT): '\n      '
     |   |   |   +-- #22--<p>
     |   |   |   |   +-- #23--(TXT): 'Is this '
     |   |   |   |   +-- #24--<a>
     |   |   |   |   |   +-- #25--(ATTR): href: 'page2.html'
     |   |   |   |   |   +-- #26--(TXT): 'a link'
     |   |   |   |   +-- #27--(TXT): '?

## XPath return types

When applied over a document, an XPath expression can return either:

* a node-set (most common case, and most often element nodes)
* a string
* a number (floating point)
* a boolean

**Note: When an XPath expression returns node-sets, you do get a set of nodes, even if there's only one node in the set.**

The root node is a special node:

> The root node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node.

In [5]:
# XPath expression
#              |
#              v
doc.xpathEval('/')

[<xmlDoc (http://www.example.com) object at 0x7f7ad4130050>]

### Selecting elements (a.k.a "tags")


Children of the root node: the HTML element (a node-set)

In [6]:
# XPath expression
#               |
#             |<>|
doc.xpathEval('/*')

[<xmlNode (html) object at 0x7f7ac53f3a28>]

Get `<title>` elements:

In [7]:
#               XPath expression
#                       |
#             |<-------------->|
doc.xpathEval('/html/head/title')

[<xmlNode (title) object at 0x7f7ad4130998>]

Get the text nodes of `<title>` elements:

In [8]:
#                   XPath expression
#                          |
#             |<--------------------->|
doc.xpathEval('/html/head/title/text()')

[<xmlNode (text) object at 0x7f7ac5403dd0>]

Selecting all paragraph elements inside the HTML body:

In [9]:
#           XPath expression
#                  |
#             |<------->|
doc.xpathEval('//body//p')

[<xmlNode (p) object at 0x7f7ac5403f80>,
 <xmlNode (p) object at 0x7f7ac53fb0e0>]

### Get a string representation of an element

In [10]:
#                    XPath expression
#                           |
#             |<---------------------->|
doc.xpathEval('string(/html/head/title)')

'This is a title'

### Counting elements

Number of paragraphs in the document:

In [11]:
#           XPath expression
#                   |
#             |<-------->|
doc.xpathEval('count(//p)')

2.0

Number of attributes in the document (whatever their parent element):

In [12]:
#           XPath expression
#                   |
#             |<-------->|
doc.xpathEval('count(//@*)')

5.0

### Boolean operations

For example, testing the number of paragraphs:

In [13]:
doc.xpathEval('count(//p) = 2')

1

In [14]:
doc.xpathEval('count(//p) = 42')

0

# Location Paths: how to move inside the document tree

A **Location path** is the most common XPath expression.

It is used to move in any direction from a starting point (*the context node*) to any node(s) in the tree:

* It is a string, with a series of “steps”: `"step1 / step2 / step3 ..."`

* It represents the selection and filtering of nodes, processed step by step, from left to right.

* Each step is of the form: `AXIS :: NODETEST [PREDICATE]*`

Note: whitespace does NOT matter, except for `“//”`, (`“/ /”` is a syntax error.)

So **don’t be afraid of indenting your XPath expressions**. The following 3 expressions produce the same result


In [15]:
doc.xpathEval('/html/head/title')

[<xmlNode (title) object at 0x7f7ac53fb560>]

In [16]:
doc.xpathEval('/    html   / head   /title')

[<xmlNode (title) object at 0x7f7ac53fb5f0>]

In [17]:
doc.xpathEval('''
    /html
        /head
            /title''')

[<xmlNode (title) object at 0x7f7ac53fb8c0>]

## Relative vs. absolute paths

Location paths can be relative or absolute:

* `"step1/step2/step3"` is relative
* `"/step1/step2/step3"` is absolute

i.e. an absolute path is a relative path starting with "/" (slash)

In fact, absolute paths are relative to the root node.

**Tip**: use relative paths whenever possible. This prevents unexpected selection of same nodes in loop iterations.

For example, in our sample document, only one `<div>` contains paragraphs. Looping on each `<div>` and using the absolute location path `//p` will produce the same result for each iteration: returning ALL paragraphs in the document everytime.


In [18]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('//p'))

[<xmlNode (p) object at 0x7f7ac53fb710>, <xmlNode (p) object at 0x7f7ac53fbcb0>]
[<xmlNode (p) object at 0x7f7ac53fbcb0>, <xmlNode (p) object at 0x7f7ac53fbd40>]
[<xmlNode (p) object at 0x7f7ac53fbd40>, <xmlNode (p) object at 0x7f7ac53fbd88>]


Compare this with using the relative `p` or `./p` that will only look at children `<p>` under from each `<div>`, and only 1 of those `<div>` will show having paragraphs:

In [19]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('p'))

[]
[<xmlNode (p) object at 0x7f7ac53fb998>, <xmlNode (p) object at 0x7f7ac53fbe18>]
[]


In [20]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('./p'))

[]
[<xmlNode (p) object at 0x7f7ac53fbb90>, <xmlNode (p) object at 0x7f7ac53fbe60>]
[]


## Abbreviated syntax

What we’ve seen earlier is in fact “abbreviated syntax”.

Full syntax is quite verbose:

| Abbreviated syntax           | Full syntax
|------------------------------|-------------------------------------------------
| `/html/head/title`           | `/child::html /child:: head /child:: title`
| `//meta/@content`            | `/descendant-or-self::node() /child::meta /attribute::content`
| `//div/div[@class="second"]` | `/descendant-or-self::node() /child::div /child::div [attribute::class = "second"]`
| `//div/a/text()`             | `/descendant-or-self::node() /child::div /child::a /child::text()`

## Axes: moving around

**AXIS** :: _nodetest [predicate]*_

**Axes give the direction to go next.**

* `self` (where you are)
* `parent`, `child` (direct hop)
* `ancestor`, `ancestor-or-self`, `descendant`, `descendant-or-self` (multi-hop)
* `following`, `following-sibling`, `preceding`, `preceding-sibling` (document order)
* `attribute`, `namespace` (non-element)

### Move up or down the tree

Let's assume that we have selected the first `<div>` element in our sample document:

In [23]:
first_div = doc.xpathEval('//body/div')[0]

The `self` axis represents *the context node*, i.e. where you are currently in the step. (More on when this can be useful later.)

In [24]:
first_div.xpathEval('self::*')

[<xmlNode (div) object at 0x7f7ac540e5f0>]

The `child` axis is for immediate children nodes of the context node:

In [26]:
first_div.xpathEval('child::*')

[<xmlNode (div) object at 0x7f7ac540e638>,
 <xmlNode (div) object at 0x7f7ac540e098>]

In [32]:
order = 0
tr = LeftAligned()
print(tr(OrderedDict(traverse(doc, accept=('element'), ignore_empty_text=False))))

# 0--(ROOT)
 +-- # 1--<html>
     +-- # 3--<head>
     |   +-- # 5--<title>
     |   +-- # 8--<meta>
     +-- #13--<body>
         +-- #15--<div>
             +-- #17--<div>
             |   +-- #19--<p>
             |   +-- #22--<p>
             |   |   +-- #24--<a>
             |   +-- #29--<br>
             +-- #32--<div>
                 +-- #35--<a>


With this simplified tree representation (only considering elements, a.k.a tags), this is what `self`, `child` and `parent` select:

```
                # 0--(ROOT)
                 +-- # 1--<html>
                     +-- # 3--<head>
                     |   +-- # 5--<title>
                     |   +-- # 8--<meta>
parent::* ---------> +-- #13--<body>
                         |
self:*  ------------->   +-- #15--<div>
                             |
child:* ----+----------->    +-- #17--<div>
            |                |   +-- #19--<p>
            |                |   +-- #22--<p>
            |                |   |   +-- #24--<a>
            |                |   +-- #29--<br>
            +----------->    +-- #32--<div>
                                 +-- #35--<a>
```

The `descendant` axis is similar to `child` but also goes deeper in the tree, looking at children of each child, recursively:

In [28]:
first_div.xpathEval('descendant::*')

[<xmlNode (div) object at 0x7f7ac540ee18>,
 <xmlNode (p) object at 0x7f7ac540e9e0>,
 <xmlNode (p) object at 0x7f7ac540ecf8>,
 <xmlNode (a) object at 0x7f7ac540ea28>,
 <xmlNode (br) object at 0x7f7ac540eb48>,
 <xmlNode (div) object at 0x7f7ac540e908>,
 <xmlNode (a) object at 0x7f7ac540e950>]

You might guess already what `parent` and `ancestor` axes are for: they're the dual axes of `child` and `descendant`:

In [27]:
first_div.xpathEval('parent::*')

[<xmlNode (body) object at 0x7f7ac540e830>]

In [29]:
first_div.xpathEval('ancestor::*')

[<xmlNode (html) object at 0x7f7ac53fb050>,
 <xmlNode (body) object at 0x7f7ad41e6fc8>]