# Part 1: What is XPath?

## XPath is a language

> XPath is a language for addressing parts of an XML document 
> 

-- from [XML Path Language 1.0](https://www.w3.org/TR/xpath/)</cite>

This abstract from the official specifications says it all:

* _"XPath is a language"_: an XPath expression is a character string...
* _"for addressing parts of an XML document"_: ...a string that that you pass to an XPath engine acting over an XML (or HTML) document, outputting parts of it, and following the data model explained below.


## Why learn XPath?

* navigate **everywhere** inside a DOM tree
* a must-have skill for accurate web data extraction
* more powerful than CSS selectors
* fine-grained look at the text content
* complex conditioning with axes
* extensible with custom functions (we won’t cover that in this talk though)

## XPath data model

XPath's [data model](http://www.w3.org/TR/xpath/#data-model) is a tree of nodes representing a document.
Nodes can be:

* element nodes (`<p>...</p>`)
* attribute nodes (`href="page.html"`)
* text nodes (`"Some Title"`)
* comment nodes (`<!-- a comment -->`)
* (and 3 other types that we won’t cover here.)

In effect, this data model allows you to represent everything inside an XML or HTML document, in a structured and hierarchical way.

Throughout this tutorial, we'll use this sample HTML page to illustrate how XPath works:

```
<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>
```

In XPath's data model, everything is a node : elements, attributes, comments... (**but not all nodes are elements.**)

And nodes  have an order, the **document order**, the order they appear in the XML/HTML source.

Here is an ASCII tree representation of the HTML document for an XPath engine:

```
# 0--(ROOT)
 +-- # 1--<html>
     +-- # 2--(TXT): '\n'
     +-- # 3--<head>
     |   +-- # 4--(TXT): '\n  '
     |   +-- # 5--<title>
     |   |   +-- # 6--(TXT): 'This is a title'
     |   +-- # 7--(TXT): '\n  '
     |   +-- # 8--<meta>
     |   |   +-- # 9--(ATTR): content: 'text/html; charset=utf-8'
     |   |   +-- #10--(ATTR): http-equiv: 'content-type'
     |   +-- #11--(TXT): '\n'
     +-- #12--(TXT): '\n'
     +-- #13--<body>
     |   +-- #14--(TXT): '\n  '
     |   +-- #15--<div>
     |   |   +-- #16--(TXT): '\n    '
     |   |   +-- #17--<div>
     |   |   |   +-- #18--(TXT): '\n      '
     |   |   |   +-- #19--<p>
     |   |   |   |   +-- #20--(TXT): 'This is a paragraph.'
     |   |   |   +-- #21--(TXT): '\n      '
     |   |   |   +-- #22--<p>
     |   |   |   |   +-- #23--(TXT): 'Is this '
     |   |   |   |   +-- #24--<a>
     |   |   |   |   |   +-- #25--(ATTR): href: 'page2.html'
     |   |   |   |   |   +-- #26--(TXT): 'a link'
     |   |   |   |   +-- #27--(TXT): '?'
     |   |   |   +-- #28--(TXT): '\n      '
     |   |   |   +-- #29--<br>
     |   |   |   +-- #30--(TXT): '\n      Apparently.\n    '
     |   |   +-- #31--(TXT): '\n    '
     |   |   +-- #32--<div>
     |   |   |   +-- #33--(ATTR): class: 'second'
     |   |   |   +-- #34--(TXT): '\n      Nothing to add.\n      Except maybe this '
     |   |   |   +-- #35--<a>
     |   |   |   |   +-- #36--(ATTR): href: 'page3.html'
     |   |   |   |   +-- #37--(TXT): 'other link'
     |   |   |   +-- #38--(TXT): '. \n      '
     |   |   |   +-- #39--(COMM): ' And this comment '
     |   |   |   +-- #40--(TXT): '\n    '
     |   |   +-- #41--(TXT): '\n  '
     |   +-- #42--(TXT): '\n'
     +-- #43--(TXT): '\n'
 ```
 
You can see various tree branches and leaves:

* e.g. `<div>` or `<p>`: these are element nodes
* `(TXT)` represent text nodes
* `(ATTR)` represent attribute nodes
* `(COMM)` represent comment nodes

You can also notice that text with only whitespace (space and newlines in our example) **are** proper nodes, they do have their document order and can be selected with XPath.

## Using parsel to run XPath over a document

To illustrate and learn XPath, we will use the [parsel](https://github.com/scrapy/parsel) library. It is a Python module written on top of lxml.

Note: lxml itself is built using the C library libxml2, which has a conformant XPath 1.0 engine. You should be able to run the same XPath expressions with any XPath 1.0 engine, and get the same results.

This tutorial will only showcase XPath 1.0.

In [1]:
import parsel

Below is a small "hack" to change the representation of extracted nodes when using parsel. This is to represent return values as serialized HTML element or string, and not parsel's wrapper objects.

In [2]:
parsel.Selector.__str__ = parsel.Selector.extract
parsel.Selector.__repr__ = parsel.Selector.__str__
parsel.SelectorList.__repr__ = lambda x: '[{}]'.format(
    '\n '.join("({}) {!r}".format(i, repr(s))
               for i, s in enumerate(x, start=1))
).replace(r'\n', '\n')

To use XPath over our sample HTML document, we first need to create a `Selector`:

In [3]:
htmlsample = '''<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type" />
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br />
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>'''

doc = parsel.Selector(text=htmlsample)

## XPath return types

When applied over a document, an XPath expression can return either:

* a node-set -- this is the most common case, and often it's a set of element nodes
* a string
* a number (floating point)
* a boolean

**Note: When an XPath expression returns node-sets, you do get a set of nodes, even if there's only one node in the set.** With parsel, you get a list of nodes, not a Python set.


## XPath expressions

We will now take a look at some example XPath expressions to get a feeling of how they work. We'll explain the syntax in more details later on.

XPath expressions are passed to an XPath engine as strings. Here we are using `.xpath()` method of parsel `Selector` objects.

### Selecting the root node of a document

The root node is a special node:

> The root node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node.

Selecting the root node of a document with XPath is one of the shortest XPath expressions: `'/'` (a string with only a forward slash).

This is very similar to `cd /` (going to the root directory) in a shell within a Unix filesystem.

**Unfortunately, this does not work with parsel.** We get an empty list instead of the root node. It's a limitation of lxml apparently, because it works with libxml2 directly. In practice though, this doesn't matter, the root node is virtually never used directly.

In [4]:
# XPath expression
#          |
#          v
doc.xpath('/')

[]

### Selecting elements (a.k.a "tags")

Elements build the structure and hierarchy of the document. Selecting elements is probably the most common use-case for XPath on HTML documents.

Elements can have children -- the root node being the ancestor of them all. Their children can also have children and so on. Sometimes, they have only one child.

**Note:** text nodes are not elements. But text nodes are always children of some element. Therefore, text nodes are always leaves of the document tree.

We said earlier that the document element is a child of the root node. In fact, the document element is the only child of the root node, it's the top-level `<html>` element. Still, selecting it will return a single-node node-set:

In [5]:
# XPath expression
#           |
#           v 
doc.xpath('/*')

[(1) '<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>']

The asterisk here, `*`, means "any element". And `/*` means "any element under the root node". HTML documents have only 1 element like this: the `<html>` tag.

Another example: get `<title>` elements:

In [6]:
#           XPath expression
#                   |
#         |<-------------->|
doc.xpath('/html/head/title')

[(1) '<title>This is a title</title>']

Again, if you are familiar with the Unix filesystem, you probably intuitively understand what this does:
* start from the root (of the document)
  * select the `<html>` node
    * select the `<head>` node under the `<html>` node
      * select the `<title>` node under the `<head>` node

In other words, the XPath expression represents the path from the root node to the target node(s). Much like a Unix filepath represents the path from the filesystem root to the target file(s).

Selecting text nodes is a bit different: you use the special `text()` syntax.

To get the text nodes of `<title>` elements (remember that `<title>` is an element, and that it happens it contains a text node, with the string content "This is a title"):

In [7]:
#               XPath expression
#                      |
#         |<--------------------->|
doc.xpath('/html/head/title/text()')

[(1) 'This is a title']

Again, there's only one `<title>`, and it contains only one text node, but selecting text nodes int `<title>` returns a single-string-value list, not a string.

Now, if we want to select all `<p>` paragraph elements inside the `<body>` (we expect two of them):

In [8]:
#       XPath expression
#              |
#         |<------->|
doc.xpath('//body//p')

[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']

Here, we are introducing the double-slash (`//`) syntax to do multi-hop lookups.

If you don't know (or don't care about) the level where the element that you need is located (relative to the root node), the special shortcut `//` tells the XPath engine to search recursively further down the tree, and not directly one-level deeper like in the earlier examples.


### Selecting attributes

Elements can also have attributes.

In our sample document, we have two `<a>` elements, each with a `href` attribute. There's also a `<meta>` element with two attributes: `content` and `http-equiv`.

This is how you can select these attributes, with an `@` prefix before the attribute name:

In [9]:
doc.xpath('//a/@href')

[(1) 'page2.html'
 (2) 'page3.html']

In [10]:
doc.xpath('//meta/@*')

[(1) 'text/html; charset=utf-8'
 (2) 'content-type']

The `*` (asterisk) here after `@` means the same thing as in `/*` expect that this is for attributes, and not elements: meaning that you want any attributes, whatever their name.

### Get a string representation of an element

In [11]:
#                XPath expression
#                       |
#         |<---------------------->|
doc.xpath('string(/html/head/title)')

[(1) 'This is a title']

This example uses one of several handy string functions in XPath. `string()` will concatenate all text content from the selected node and all of its children, recursively, effectively stripping HTML tags.

What happens when you apply `string()` on the document `<body>`? You get a text representation of the document, without the tags:

In [12]:
#         XPath expression
#                 |
#         |<------------>|
doc.xpath('string(//body)')

[(1) '
  
    
      This is a paragraph.
      Is this a link?
      
      Apparently.
    
    
      Nothing to add.
      Except maybe this other link. 
      
    
  
']

### Counting elements

We said earlier that XPath expressions could also return numbers.

One example of this is counting the number of paragraphs in the document:

In [13]:
#       XPath expression
#               |
#         |<-------->|
doc.xpath('count(//p)')

[(1) '2.0']

Note that you get a floating point number back, and in the form of a string. This is specific to parsel. Another XPath engine might return a native floating point number.

Another example: get the number of attributes in the document (whatever their parent element):

In [14]:
#       XPath expression
#               |
#         |<-------->|
doc.xpath('count(//@*)')

[(1) '5.0']

### Boolean operations

For example, testing the number of paragraphs:

In [15]:
doc.xpath('count(//p) = 2')

[(1) '1']

In [16]:
doc.xpath('count(//p) = 42')

[(1) '0']

# Part 2: Location Paths: how to move inside the document tree

A **Location path** is the most common XPath expression.

It is used to move in any direction from a starting point (*the context node*) to any node(s) in the tree:

* It is a string, with a series of “location steps”: `"step1 / step2 / step3 ..."`
* It represents the selection and filtering of nodes, processed step by step, from left to right.
* Each step is of the form: `AXIS :: NODETEST [PREDICATE]*`

So the examples we saw earlier are or contain an XPath expression: `/html/head/title`, `//body//p` etc.

**Note:** whitespace does NOT matter, except for `“//”` and `“..”` (`“/   /”` and `“.  .”` are  syntax errors.). The following 3 expressions produce the same result


In [17]:
doc.xpath('/html/head/title')

[(1) '<title>This is a title</title>']

In [18]:
doc.xpath('/    html   / head   /title')

[(1) '<title>This is a title</title>']

In [19]:
doc.xpath('''
    /html
        /head
            /title''')

[(1) '<title>This is a title</title>']

So **don’t be afraid of indenting your XPath expressions to improve readability.**

## Relative vs. absolute paths

Location paths can be relative or absolute:

* `"step1/step2/step3"` is relative
* `"/step1/step2/step3"` is absolute

i.e. an absolute path is a relative path starting with "/" (slash)

In other terms, absolute paths are relative to the root node.

**Tip**: use relative paths whenever possible. This prevents unexpected selection of same nodes in loop iterations.

For example, in our sample document, only one `<div>` contains paragraphs. Looping on each `<div>` and using the absolute location path `//p` will produce the same result for each iteration: returning ALL paragraphs in the document everytime.


In [20]:
for div in doc.xpath('//body//div'):
    print(div.xpath('//p'))

[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']
[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']
[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']


Compare this with using the relative `'p'` or `'./p'` expression that will only look at children `<p>` under each `<div>`, and only one of those `<div>` will show having paragraphs as shown below:

In [21]:
for div in doc.xpath('//body//div'):
    print(div.xpath('p'))

[]
[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']
[]


In [22]:
for div in doc.xpath('//body//div'):
    print(div.xpath('./p'))

[]
[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']
[]


## Abbreviated syntax

What we’ve seen earlier is in fact the “abbreviated syntax” for XPath expressions.

The full syntax is quite verbose (but you sometimes need it):

| Abbreviated syntax           | Full syntax
|------------------------------|-------------------------------------------------
| `/html/head/title`           | `/child::html /child:: head /child:: title`
| `//meta/@content`            | `/descendant-or-self::node() /child::meta /attribute::content`
| `//div/div[@class="second"]` | `/descendant-or-self::node() /child::div /child::div [attribute::class = "second"]`
| `//div/a/text()`             | `/descendant-or-self::node() /child::div /child::a /child::text()`

What are these `child::`, `descendant-or-self::` and `attribute::`, you may ask? They're axes.

## Axes: moving around

Remember: each step of an XPath location path is of the form `AXIS :: NODETEST [PREDICATE]*`.

The "axis" is the first part of each location path step. It can be explicit, or implicit in abbreviated syntax. For example, in `/html/head/title`, the `child::` axis is omitted in each step.

In this section, we'll use explicit axes as much as we can.

**AXIS** :: _nodetest [predicate]*_

**Axes give the direction to go next, one location step at a time**

* `self` (where you are)
* `parent`, `child` (direct hop)
* `ancestor`, `ancestor-or-self`, `descendant`, `descendant-or-self` (multi-hop)
* `following`, `following-sibling`, `preceding`, `preceding-sibling` (document order)
* `attribute`, `namespace` (non-element)

### Move up or down the tree: self, child, descendant, parent, ancestor

Let's assume that we have selected the first `<div>` element in our sample document, the one just under the `<body>` element:

In [23]:
first_div = doc.xpath('//body/div')[0]
first_div

<div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>

The `self` axis represents *the context node*, i.e. where you are currently in the Location Path step. (This may not sounds very useful, but we will see later when this can be handy.)

In [24]:
first_div.xpath('self::*')

[(1) '<div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>']

If you chain `self::` steps, you'll stay on the same context node:

In [25]:
first_div.xpath('self::*/self::*/self::*')

[(1) '<div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>']

`self::` is usually seen in abbreviated form: i.e. a '.' (dot). So you could aslo use:

In [26]:
first_div.xpath('.')

[(1) '<div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>']

In [27]:
first_div.xpath('././.')

[(1) '<div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>']

The `child` axis is for immediate children nodes of the context node. Here, our context `<div>` node has 2 `<div>` children:

In [28]:
first_div.xpath('child::*')

[(1) '<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>'
 (2) '<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>']

`child` is in fact the default axis, hence it can be omitted (e.g. we saw that `/html/head/title` is equivalent of `/child::html/child::head/child::title`.)

The `parent` axis is the dual of `child`: you go up one level in the DOM:

In [29]:
first_div.xpath('parent::*')

[(1) '<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>']

There's an alias for `parent::`: it's `..` (2 dots, much like in a Unix filesystem):

In [30]:
first_div.xpath('..')

[(1) '<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>']

Let's simplify our ASCII tree representation from earlier to only consider element nodes:

```
# 0--(ROOT)
 +-- # 1--<html>
     +-- # 3--<head>
     |   +-- # 5--<title>
     |   +-- # 8--<meta>
     +-- #13--<body>
         +-- #15--<div>
             +-- #17--<div>
             |   +-- #19--<p>
             |   +-- #22--<p>
             |   |   +-- #24--<a>
             |   +-- #29--<br>
             +-- #32--<div>
                 +-- #35--<a>
```

With this simplified tree representation, this is what `self`, `child` and `parent` select:

```
                # 0--(ROOT)
                 +-- # 1--<html>
                     +-- # 3--<head>
                     |   +-- # 5--<title>
                     |   +-- # 8--<meta>
parent::* ---------> +-- #13--<body>
                         |
self::* ------------->   +-- #15--<div>
                             |
child::*----+----------->    +-- #17--<div>
            |                |   +-- #19--<p>
            |                |   +-- #22--<p>
            |                |   |   +-- #24--<a>
            |                |   +-- #29--<br>
            +----------->    +-- #32--<div>
                                 +-- #35--<a>
```

#### Recursively go up or down

The `descendant` axis is similar to `child` but also goes deeper in the tree, looking at children of each child, recursively:

In [31]:
first_div.xpath('descendant::*')

[(1) '<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>'
 (2) '<p>This is a paragraph.</p>'
 (3) '<p>Is this <a href="page2.html">a link</a>?</p>'
 (4) '<a href="page2.html">a link</a>'
 (5) '<br>'
 (6) '<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>'
 (7) '<a href="page3.html">other link</a>']

You might guess already what `ancestor` is for: it is the dual axis of `descendant`:

In [32]:
first_div.xpath('ancestor::*')

[(1) '<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>'
 (2) '<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>']

#### Special case of `descendant-or-self` axis

TODO: explain

In [33]:
first_div.xpath('./descendant-or-self::node()/text()')

[(1) '
    '
 (2) '
      '
 (3) 'This is a paragraph.'
 (4) '
      '
 (5) 'Is this '
 (6) 'a link'
 (7) '?'
 (8) '
      '
 (9) '
      Apparently.
    '
 (10) '
    '
 (11) '
      Nothing to add.
      Except maybe this '
 (12) 'other link'
 (13) '. 
      '
 (14) '
    '
 (15) '
  ']

### Move "sideways": children nodes of the same parent
 
If nodes can have parents, children, ancestors and descendants, they can also have siblings (to continue the family metaphor). **Siblings are nodes that have the same parent node.**
 
Some siblings may come before the context node (they appear before in the document, their order is lower), or they can come after the context node. There are 2 axis for these 2 directions: `preceding-sibling` and `following-sibling`.

Let's first select this paragraph from our sample document: `<p>Is this <a href="page2.html">a link</a>?</p>`. It's the 2nd child of the 1st `<div>` of the `<div>` we used above:

In [34]:
paragraph = first_div.xpath('child::div[1]/child::p[2]')[0]

You can notice above that we started using 2 new patterns along with the axes:

- `child::div` vs. `child::*`: `*` means "any element node" (this is a _NODETEST_ that we'll cover afterwards)
- `[1]` and `[2]`: which mean _first_ and _second_ in the current step's node-set (this is a kind of _PREDICATE_ that we'll cover afterwards also)

In [35]:
paragraph.xpath('preceding-sibling::*')

[(1) '<p>This is a paragraph.</p>']

In [36]:
paragraph.xpath('following-sibling::*')

[(1) '<br>']

Again, let's see which elements were selected in our ASCII tree representation:

```
                # 0--(ROOT)
                 +-- # 1--<html>
                     +-- # 3--<head>
                     |   +-- # 5--<title>
                     |   +-- # 8--<meta>
                     +-- #13--<body>
                         |
                         +-- #15--<div>
                             |
                             +-- #17--<div>
                             |   |
                             |   |
preceding-sibling::* ----------> +-- #19--<p>
                             |   |
                             |   |
self::* -----------------------> +-- #22--<p>
                             |   |   |
                             |   |   +-- #24--<a>
                             |   |
                             |   |
following-sibling::* ----------> +-- #29--<br>
                             |
                             |
                             +-- #32--<div>
                                 +-- #35--<a>
```

In [37]:
paragraph.xpath('following-sibling::node()')

[(1) '
      '
 (2) '<br>'
 (3) '
      Apparently.
    ']

#### Nodes before and after, in document order

`preceding` and `following` are 2 special axes that do not look at the tree hierarchy, but work on the document order of nodes.

Remember, all nodes in XPath data model have an order, called the _document order_. Node 1 is the first node in the HTML source, node 2 is the node appearing next etc.

```
  #1    #2    #3   ...
<html><head><title>...
```

In [38]:
paragraph.xpath('preceding::*')

[(1) '<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>'
 (2) '<title>This is a title</title>'
 (3) '<meta content="text/html; charset=utf-8" http-equiv="content-type">'
 (4) '<p>This is a paragraph.</p>']

In [39]:
paragraph.xpath('following::*')

[(1) '<br>'
 (2) '<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>'
 (3) '<a href="page3.html">other link</a>']

Note that `preceding` does not include ancestors and `following` does not include descendants.

This property is mentioned in XPath specs like this:

> The ancestor, descendant, following, preceding and self axes partition a document (ignoring attribute and namespace nodes): they do not overlap and together they contain all the nodes in the document.

i.e. `document == self U (ancestor U preceding) U (descendant U following)`

## Node tests

A "node test" is the second part of each step in a location path.

_axis_ :: **NODETEST** _[predicate]*_

Node tests select node types along the step's axis.

They can be:

* a *name test*:

  * such as "p", "title" or "a" for elements: `/html/head/title` contains 3 steps, each with a name test node test
  * or "href" or "src" for attributes: `/a/@href` selects "href" attributes of 
 
* a *node type test":

  * "node()": any node type
  * "text()": text nodes
  * "comment()": comment nodes
  * "*" (an asterisk): the meaning depends on the axis:
    * an "*" step alone selects any element nodes (a.k.a tags)
    * an "@*" selects any attribute node

**Note:** `text()` is not a function call that converts a node to it's text representation, it's just a test on the node type.

Compare these 2 expressions:

In [40]:
paragraph.xpath('child::text()')

[(1) 'Is this '
 (2) '?']

In [41]:
paragraph.xpath('string(self::*)')

[(1) 'Is this a link?']

`child::text()` selector all children nodes that are also text nodes. "a" is part of the `<a>` inside the paragraph, sot it's not selected.

Whereas `string(self::*)` applies to the paragraph (the context node, selected with `self::*`) and recursively gets text content of children, children of children and so on.

## Predicates

_axis_ :: _nodetest_ ** [PREDICATE]* **

Predicates are the last part of each step in a location path. Predicates are optional.

They are used to further filter nodes on properties that cannot be expressed with the step's axis and node test.

Remember that XPath location paths work step by step. Each step produces a node-set for each node from the previous step's node-set, with possibly more than 1 node in each node set.

You may not be interested in all nodes from a node test.

The syntax for predicates is simple: just surround conditions withing square brackets. What's inside the square brackets can be:

- a number (see positional predicates below)
- a location path: the predicate will select nodes for which the location path matches at least a node
- a boolean operation: for example to test a condition on text content or count of children

### Positional predicates

The first use-case is selecting nodes based on their position in a node-set. Node-sets order depends on the axis, but let's consider that the order of a node in a node-set is the document order.

Let's say we don't want the 2 paragraphs in the `<div>` we looked at earlier, only the first one:

In [42]:
doc.xpath('//body/div/div/p')

[(1) '<p>This is a paragraph.</p>'
 (2) '<p>Is this <a href="page2.html">a link</a>?</p>']

In [43]:
doc.xpath('//body/div/div/p[1]')

[(1) '<p>This is a paragraph.</p>']

If you want the last node in a node-set, you can use `last()`:

In [44]:
doc.xpath('//body/div/div[last()]')

[(1) '<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>']

#### Position ranges

TODO: things like `//table/tbody/tr[position() > 2]`

### Location paths as predicates

TODO: things like `//table[tr/div/a]`

### Boolean predicates

TODO: things like `//table[count(tr)=10]`

#### Special case of string value tests

TODO: things like `//table[.//img/@src="pic.png"]` or `//table[th="Some headers"]`

#### Special trick for testing multiple node names

TODO: things like `./descendant-or-self::*[self::ul or self::ol]`

### Nested predicates

We said that location paths can be used as predicate. And location paths can have predicates. So it's possible end up with nested predicates.

In [45]:
#                <------predicate --------->
#                    <-nested predicate->
doc.xpath('//div[p  [a/@href="page2.html"]  ]')

[(1) '<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>']

In fact, the above is equivalent to `//div[p/a/@href="page2.html"]` with no nesting:

In [46]:
doc.xpath('//div[p/a/@href="page2.html"]')

[(1) '<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>']

### Order of predicates is important

You can have multiple predicates in sequence per step, each within its `[]` brackets, i.e. steps in the form of `axis::nodetest[predicate#1][predicate#2][predicate#3]...`

Predicates are processed in order, from left to right. And the output of one predicate is fed into the next predicate filter, much like steps produce node-sets for the next step to process.

So the order of predicates is important.

The following 2 location paths produce different results:


In [47]:
doc.xpath('//div[2][@class="second"]')

[(1) '<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>']

In [48]:
doc.xpath('//div[@class="second"][2]')

[]

The 2nd produces nothing. Why is that?

`//div[2][@class="second"]` looks at `div` elements that are the 2nd child of their parent (because `div` means `child::div`, and `[2]` will select the 2nd node in the current node-set.
In our document this happens only once.
The final predicate, `[@class="second"]`, filter nodes that have a "class" attribute with value "second". This happens to be valid for that 2nd child `div`.

On the contrary, `//div[@class="second"][2]` will first produce `//div[@class="second"]`, which only produces single-node node-sets (again, there's only 1 `div` with "class" attribute with value "second").
So the subsequent `[2]` predicate will never match with single-node node-sets.

### Abbreviation cheatsheet

| Abbreviated step             | Meaning    
|------------------------------|-------------------------------------------------
| `*` (asterisk)               | all **element** nodes (i.e. not text nodes, not attribute nodes;
|                              | remember that `.//*` is not the same as `.//node()`;
|                              | also, there's not `element()` node test
| `@*`                         | `attribute::*` (all attribute nodes)
| `//`                         | `/descendant-or-self::node()/
|                              | exactly this, nothing more,  nothing less,
|                              | so `//*` is not the same as `/descendant-or-self::*`
| `.` (a single dot)           | `self::node()`, the context node; useful for formation a relative XPath
|                              | e.g. `.//tr`
| `..` (2 dots)                | `parent::node()`

TODO: explain why `//*` is not the same as `/descendant-or-self::*`

## String functions

TODO

# Part 3: Use-cases for web scraping

TODO

## Text extraction

TODO

## Attributes extraction

TODO

### Attribute names extractions

TODO

## CSS Selectors

TODO

## Loop on elements (table rows, lists)

TODO

## Element boundaries & XPath buckets (advanced)

TODO

## EXSLT extensions

TODO

# Summary of tips

* Use relative XPath expressions whenever possible
* Know your axes!
* Don't forget that XPath has `string()` and `normalize-space()` functions
* **`text()` is a node test**, not a function call
* CSS selectors are very handy, easier to maintain, but also less powerful than XPath