# Part 1: What is XPath?

## XPath is a language

> XPath is a language for addressing parts of an XML document 
> 

-- from [XML Path Language 1.0](https://www.w3.org/TR/xpath/)</cite>

This abstract from the official specifications says it all:

* "is a language": you pass a character string (an XPath expression)...
* "for addressing parts of an XML document": ...a string that that you pass to an XPath engine acting over an XML (or HTML) document, outputting parts of it following the data model below.

XPath [data model](http://www.w3.org/TR/xpath/#data-model) is a tree of nodes:
* element nodes (`<p>...</p>`)
* attribute nodes (`href="page.html"`)
* text nodes (`"Some Title"`)
* comment nodes (`<!-- a comment -->`)
* (and 3 other types that we won’t cover here.)

In effect, this data model allows you to represent everything inside an XML or HTML document, in a structured and hierarchical way.

## Why learn XPath?

* navigate **everywhere** inside a DOM tree
* a must-have skill for accurate web data extraction
* more powerful than CSS selectors
* fine-grained look at the text content
* complex conditioning with axes
* extensible with custom functions (we won’t cover that in this talk though)

## XPath data model and first examples

Let's use this sample HTML page to illustrate how XPath works:

```
<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type">
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>
```

In [1]:
import libxml2

In [2]:
htmlsample = '''<html>
<head>
  <title>This is a title</title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type" />
</head>
<body>
  <div>
    <div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br />
      Apparently.
    </div>
    <div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>
  </div>
</body>
</html>'''

In [3]:
doc = libxml2.htmlReadDoc(htmlsample, 'http://www.example.com', 'utf-8', 0)

In XPath's data model, everything is a node : elements, attributes, comments... (but not all nodes are elements for example.)

And nodes  have an order, the **document order**, the order they appear in the XML/HTML source.

The little function below builds an ASCII tree representation of the HTML document. You don't have to understand what it does right now, simply look at the output:

In [4]:
from collections import OrderedDict
from asciitree import LeftAligned


order = 0
def traverse(node, accept=('text', 'attribute', 'comment'), ignore_empty_text=True):
    global order
    children_list = OrderedDict()
    _name = "#%2d--<%s>" % (order, node.name)
    _type = node.get_type()
    if node.isText():
        if 'text' not in accept:
            return
        if ignore_empty_text and not node.content.strip():
            return
        _content = '#%2d--(TXT): %r' % (order, node.content)
    elif _type == 'attribute':
        if 'attribute' not in accept:
            return
        _content = '#%2d--(ATTR): %s: %r' % (order, node.name, node.content)
    elif _type == 'comment':
        if 'comment' not in accept:
            return
        _content = '#%2d--(COMM): %r' % (order, node.content)
    elif _type == 'document_html':
        _content = '#%2d--(ROOT)' % (order,)
    else:
        _content = _name
    
    for child in node.xpathEval('(child::node() | attribute::*)'):
        order += 1
        r = traverse(child, accept=accept, ignore_empty_text=ignore_empty_text)
        if r is not None:
            children_list.update(r)
            
    return {_content: children_list}


tr = LeftAligned()
print(tr(OrderedDict(traverse(doc, ignore_empty_text=False))))

# 0--(ROOT)
 +-- # 1--<html>
     +-- # 2--(TXT): '\n'
     +-- # 3--<head>
     |   +-- # 4--(TXT): '\n  '
     |   +-- # 5--<title>
     |   |   +-- # 6--(TXT): 'This is a title'
     |   +-- # 7--(TXT): '\n  '
     |   +-- # 8--<meta>
     |   |   +-- # 9--(ATTR): content: 'text/html; charset=utf-8'
     |   |   +-- #10--(ATTR): http-equiv: 'content-type'
     |   +-- #11--(TXT): '\n'
     +-- #12--(TXT): '\n'
     +-- #13--<body>
     |   +-- #14--(TXT): '\n  '
     |   +-- #15--<div>
     |   |   +-- #16--(TXT): '\n    '
     |   |   +-- #17--<div>
     |   |   |   +-- #18--(TXT): '\n      '
     |   |   |   +-- #19--<p>
     |   |   |   |   +-- #20--(TXT): 'This is a paragraph.'
     |   |   |   +-- #21--(TXT): '\n      '
     |   |   |   +-- #22--<p>
     |   |   |   |   +-- #23--(TXT): 'Is this '
     |   |   |   |   +-- #24--<a>
     |   |   |   |   |   +-- #25--(ATTR): href: 'page2.html'
     |   |   |   |   |   +-- #26--(TXT): 'a link'
     |   |   |   |   +-- #27--(TXT): '?

You can see various tree branches and leaves:

* e.g. `<div>` or `<p>`: these are element nodes
* `(TXT)` represent text nodes
* `(ATTR)` represent attribute nodes
* `(COMM)` represent comment nodes

## XPath return types

When applied over a document, an XPath expression can return either:

* a node-set (most common case, and most often element nodes)
* a string
* a number (floating point)
* a boolean

**Note: When an XPath expression returns node-sets, you do get a set of nodes, even if there's only one node in the set.**

## XPath expressions

We will now take a look at some example XPath expressions to get a feeling of how they work. We'll explain the syntax in more details later on.

XPath expressions are passed to an XPath engine as strings. Here we are using `libxml2`'s `.xpathEval()` method on a parsed HTML document object.

### Selecting the root node of a document

The root node is a special node:

> The root node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node.

Selecting the root node of a document with XPath is one of the shortest XPath expressions: `/` (a forward slash).

This is very similar to `cd /` (going to the root directory) in a shell within a Unix filesystem.

In [5]:
# XPath expression
#              |
#              v
doc.xpathEval('/')

[<xmlDoc (http://www.example.com) object at 0x7f0b9be09170>]

### Selecting elements (a.k.a "tags")


Children of the root node: the HTML element (a node-set)

In [6]:
# XPath expression
#               |
#             |<>|
doc.xpathEval('/*')

[<xmlNode (html) object at 0x7f0b941a05f0>]

The asterisk here, `*`, means "any element". And `/*` means "any element under the root node". HTML documents (usually) have only 1 element like this: the `<html>` tag.

Get `<title>` elements, with its explicit path from root:

In [7]:
#               XPath expression
#                       |
#             |<-------------->|
doc.xpathEval('/html/head/title')

[<xmlNode (title) object at 0x7f0b941b1ab8>]

Again, if you are familiar with the Unix filesystem, you probably intuitively understand what this does:
* start from the root (of the document)
* select the `<html>` node
* select the `<head>` node under the `<html>` node
* select the `<title>` node under the `<head>` node

In other words, the XPath expression represents the path from the root node to the target node(s). Much file a Unix filepath represents the path from the filesystem root to the target file(s).

Selecting text nodes is a bit different: you use the special `text()` syntax.

Get the text nodes of `<title>` elements (remember that `<title>` is an element, and that is happens to contain a text node, the text value "This is a title":

In [8]:
#                   XPath expression
#                          |
#             |<--------------------->|
doc.xpathEval('/html/head/title/text()')

[<xmlNode (text) object at 0x7f0b941c1d40>]

Selecting all paragraph elements inside the HTML body:

In [9]:
#           XPath expression
#                  |
#             |<------->|
doc.xpathEval('//body//p')

[<xmlNode (p) object at 0x7f0b941c1a28>,
 <xmlNode (p) object at 0x7f0b941c1ef0>]

Here, we're introducing the double-slash (`//`) syntax to do multi-hop lookups.

If you don't know (or don't care about) the level where the element that you need is located (relative to the root node), `//` tells the XPath engine to search recursively further down the tree, and not directly one-level deeper like the examples above.


### Selecting attributes

Elements can also have attributes. In the sample HTML document we are using, we have 2 `<a>` elements, each with a `href` attribute. There's also a `<meta>` element with 2 attributes: `content` and `http-equiv`.

This is how you can select these attributes with an `@` prefix before the attribute name:

In [10]:
doc.xpathEval('//a/@href')

[<xmlAttr (href) object at 0x7f0b941ae0e0>,
 <xmlAttr (href) object at 0x7f0b941ae320>]

In [11]:
doc.xpathEval('//meta/@*')

[<xmlAttr (content) object at 0x7f0b941ae2d8>,
 <xmlAttr (http-equiv) object at 0x7f0b941ae5a8>]

The asterisk here after `@` means the same thing as in `/*` expect that this is for attributes, and not elements: meaning that you want all attributes, whatever their name.

### Get a string representation of an element

In [12]:
#                    XPath expression
#                           |
#             |<---------------------->|
doc.xpathEval('string(/html/head/title)')

'This is a title'

This example uses one of several handy string functions in XPath. `string()` will concatenate all text content from the selected node and all of its children, recursively, effectively stripping HTML tags.

### Counting elements

Number of paragraphs in the document:

In [13]:
#           XPath expression
#                   |
#             |<-------->|
doc.xpathEval('count(//p)')

2.0

Note that you get a floating point number back.

Number of attributes in the document (whatever their parent element):

In [14]:
#           XPath expression
#                   |
#             |<-------->|
doc.xpathEval('count(//@*)')

5.0

### Boolean operations

For example, testing the number of paragraphs:

In [15]:
doc.xpathEval('count(//p) = 2')

1

In [16]:
doc.xpathEval('count(//p) = 42')

0

# Part 2: Location Paths: how to move inside the document tree

A **Location path** is the most common XPath expression.

It is used to move in any direction from a starting point (*the context node*) to any node(s) in the tree:

* It is a string, with a series of “steps”: `"step1 / step2 / step3 ..."`

* It represents the selection and filtering of nodes, processed step by step, from left to right.

* Each step is of the form: `AXIS :: NODETEST [PREDICATE]*`

Note: whitespace does NOT matter, except for `“//”`, (`“/ /”` is a syntax error.)

So **don’t be afraid of indenting your XPath expressions**. The following 3 expressions produce the same result


In [17]:
doc.xpathEval('/html/head/title')

[<xmlNode (title) object at 0x7f0b941ae9e0>]

In [18]:
doc.xpathEval('/    html   / head   /title')

[<xmlNode (title) object at 0x7f0b941aea70>]

In [19]:
doc.xpathEval('''
    /html
        /head
            /title''')

[<xmlNode (title) object at 0x7f0b941aed40>]

## Relative vs. absolute paths

Location paths can be relative or absolute:

* `"step1/step2/step3"` is relative
* `"/step1/step2/step3"` is absolute

i.e. an absolute path is a relative path starting with "/" (slash)

In other terms, absolute paths are relative to the root node.

**Tip**: use relative paths whenever possible. This prevents unexpected selection of same nodes in loop iterations.

For example, in our sample document, only one `<div>` contains paragraphs. Looping on each `<div>` and using the absolute location path `//p` will produce the same result for each iteration: returning ALL paragraphs in the document everytime.


In [20]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('//p'))

[<xmlNode (p) object at 0x7f0b941ab128>, <xmlNode (p) object at 0x7f0b941ab170>]
[<xmlNode (p) object at 0x7f0b941ab128>, <xmlNode (p) object at 0x7f0b941ab1b8>]
[<xmlNode (p) object at 0x7f0b941ab128>, <xmlNode (p) object at 0x7f0b941ab290>]


Compare this with using the relative `p` or `./p` that will only look at children `<p>` under from each `<div>`, and only 1 of those `<div>` will show having paragraphs:

In [21]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('p'))

[]
[<xmlNode (p) object at 0x7f0b941ab290>, <xmlNode (p) object at 0x7f0b941ab200>]
[]


In [22]:
for div in doc.xpathEval('//body//div'):
    print(div.xpathEval('./p'))

[]
[<xmlNode (p) object at 0x7f0b941ab200>, <xmlNode (p) object at 0x7f0b941ab2d8>]
[]


## Abbreviated syntax

What we’ve seen earlier is in fact “abbreviated syntax”.

Full syntax is quite verbose:

| Abbreviated syntax           | Full syntax
|------------------------------|-------------------------------------------------
| `/html/head/title`           | `/child::html /child:: head /child:: title`
| `//meta/@content`            | `/descendant-or-self::node() /child::meta /attribute::content`
| `//div/div[@class="second"]` | `/descendant-or-self::node() /child::div /child::div [attribute::class = "second"]`
| `//div/a/text()`             | `/descendant-or-self::node() /child::div /child::a /child::text()`

## Axes: moving around

The "axis" is the first part of each location path step. It can be explicit or implicit in abbreviated syntax.

In this section, we'll use explicit axes as much as we can.

**AXIS** :: _nodetest [predicate]*_

**Axes give the direction to go next.**

* `self` (where you are)
* `parent`, `child` (direct hop)
* `ancestor`, `ancestor-or-self`, `descendant`, `descendant-or-self` (multi-hop)
* `following`, `following-sibling`, `preceding`, `preceding-sibling` (document order)
* `attribute`, `namespace` (non-element)

### Move up or down the tree

Let's assume that we have selected the first `<div>` element in our sample document, the one just under the `<body>` element:

In [23]:
first_div = doc.xpathEval('//body/div')[0]

The `self` axis represents *the context node*, i.e. where you are currently in the Location Path step. (This may not sounds very useful, but we will see later when this can be handy.)

In [24]:
first_div.xpathEval('self::*')

[<xmlNode (div) object at 0x7f0b941ab0e0>]

The `child` axis is for immediate children nodes of the context node. Here, our context `<div>` node has 2 `<div>` children:

In [25]:
first_div.xpathEval('child::*')

[<xmlNode (div) object at 0x7f0b941ab2d8>,
 <xmlNode (div) object at 0x7f0b941ab4d0>]

The `parent` axis is the dual of `child`: you go up one level in the DOM:

In [26]:
first_div.xpathEval('parent::*')

[<xmlNode (body) object at 0x7f0b941ab518>]

Let's simplify our ASCII tree representation from earlier to only consider element nodes:

In [27]:
order = 0
tr = LeftAligned()
print(tr(OrderedDict(traverse(doc, accept=('element'), ignore_empty_text=False))))

# 0--(ROOT)
 +-- # 1--<html>
     +-- # 3--<head>
     |   +-- # 5--<title>
     |   +-- # 8--<meta>
     +-- #13--<body>
         +-- #15--<div>
             +-- #17--<div>
             |   +-- #19--<p>
             |   +-- #22--<p>
             |   |   +-- #24--<a>
             |   +-- #29--<br>
             +-- #32--<div>
                 +-- #35--<a>


With this simplified tree representation, this is what `self`, `child` and `parent` select:

```
                # 0--(ROOT)
                 +-- # 1--<html>
                     +-- # 3--<head>
                     |   +-- # 5--<title>
                     |   +-- # 8--<meta>
parent::* ---------> +-- #13--<body>
                         |
self::* ------------->   +-- #15--<div>
                             |
child::*----+----------->    +-- #17--<div>
            |                |   +-- #19--<p>
            |                |   +-- #22--<p>
            |                |   |   +-- #24--<a>
            |                |   +-- #29--<br>
            +----------->    +-- #32--<div>
                                 +-- #35--<a>
```

#### Recursively go up or down

The `descendant` axis is similar to `child` but also goes deeper in the tree, looking at children of each child, recursively:

In [28]:
first_div.xpathEval('descendant::*')

[<xmlNode (div) object at 0x7f0b941abea8>,
 <xmlNode (p) object at 0x7f0b941abb48>,
 <xmlNode (p) object at 0x7f0b941abfc8>,
 <xmlNode (a) object at 0x7f0b941abf80>,
 <xmlNode (br) object at 0x7f0b941c82d8>,
 <xmlNode (div) object at 0x7f0b941c8098>,
 <xmlNode (a) object at 0x7f0b941c8128>]

You might guess already what `ancestor` is for: it is the dual axis of `descendant`:

In [29]:
first_div.xpathEval('ancestor::*')

[<xmlNode (html) object at 0x7f0b941c83f8>,
 <xmlNode (body) object at 0x7f0b941c88c0>]

#### Special case of `descendant-or-self` axis

TODO

### Move "sideways": children nodes of the same parent
 
If nodes can have parents, children, ancestors and descendants, they can also have siblings (to continue the family metaphor). **Siblings are nodes that have the same parent node.**
 
Some siblings may come before the context node (they appear before in the document, their order is lower), or they can come after the context node. There are 2 axis for these 2 directions: `preceding-sibling` and `following-sibling`.

Let's first select this paragraph from our sample document: `<p>Is this <a href="page2.html">a link</a>?</p>`. It's the 2nd child of the 1st `<div>` of the `<div>` we used above:

In [30]:
paragraph = first_div.xpathEval('child::div[1]/child::p[2]')[0]

You can notice above that we started using 2 new patterns along with the axes:

- `child::div` vs. `child::*`: `*` means "any element node" (this is a _NODETEST_ that we'll cover afterwards)
- `[1]` and `[2]`: which mean _first_ and _second_ in the current step's node-set (this is a kind of _PREDICATE_ that we'll cover afterwards also)

In [31]:
paragraph.xpathEval('preceding-sibling::*')

[<xmlNode (p) object at 0x7f0b941c8830>]

In [32]:
paragraph.xpathEval('following-sibling::*')

[<xmlNode (br) object at 0x7f0b941c8d40>]

Again, let's see which elements were selected in our ASCII tree representation:

```
                # 0--(ROOT)
                 +-- # 1--<html>
                     +-- # 3--<head>
                     |   +-- # 5--<title>
                     |   +-- # 8--<meta>
                     +-- #13--<body>
                         |
                         +-- #15--<div>
                             |
                             +-- #17--<div>
                             |   |
                             |   |
preceding-sibling::* ----------> +-- #19--<p>
                             |   |
                             |   |
self::* -----------------------> +-- #22--<p>
                             |   |   |
                             |   |   +-- #24--<a>
                             |   |
                             |   |
following-sibling::* ----------> +-- #29--<br>
                             |
                             |
                             +-- #32--<div>
                                 +-- #35--<a>
```

In [33]:
paragraph.xpathEval('following-sibling::node()')

[<xmlNode (text) object at 0x7f0b94150050>,
 <xmlNode (br) object at 0x7f0b941500e0>,
 <xmlNode (text) object at 0x7f0b94150098>]

#### Nodes before and after, in document order

`preceding` and `following` are 2 special axes that do not look at the tree hierarchy, but work on the document order of nodes.

Remember, all nodes in XPath data model have an order, called the _document order_. Node 1 is the first node in the HTML source, node 2 is the node appearing next etc.

```
  #1    #2    #3   ...
<html><head><title>...
```

In [34]:
paragraph.xpathEval('preceding::*')

[<xmlNode (head) object at 0x7f0b94150200>,
 <xmlNode (title) object at 0x7f0b94150488>,
 <xmlNode (meta) object at 0x7f0b941501b8>,
 <xmlNode (p) object at 0x7f0b941504d0>]

In [35]:
paragraph.xpathEval('following::*')

[<xmlNode (br) object at 0x7f0b941c8e60>,
 <xmlNode (div) object at 0x7f0b941c87a0>,
 <xmlNode (a) object at 0x7f0b941c8fc8>]

Note that `preceding` does not include ancestors and `following` does not include descendants.

This property is mentioned in XPath specs like this:

> The ancestor, descendant, following, preceding and self axes partition a document (ignoring attribute and namespace nodes): they do not overlap and together they contain all the nodes in the document.

i.e. `document == self U (ancestor U preceding) U (descendant U following)`

## Node tests

A "node test" is the second part of each step in a location path.

_axis_ :: **NODETEST** _[predicate]*_

Node tests select node types along the step's axis.

They can be:

* a *name test*:

  * such as "p", "title" or "a" for elements: `/html/head/title` contains 3 steps, each with a name test node test
  * or "href" or "src" for attributes: `/a/@href` selects "href" attributes of 
 
* a *node type test":

  * "node()": any node type
  * "text()": text nodes
  * "comment()": comment nodes
  * "*" (an asterisk): the meaning depends on the axis:
    * an "*" step alone selects any element nodes (a.k.a tags)
    * an "@*" selects any attribute node

**Note:** `text()` is not a function call that converts a node to it's text representation, it's just a test on the node type.

Compare these 2 expressions:

In [36]:
for t in paragraph.xpathEval('child::text()'):
    print("%r" % t.get_content())

'Is this '
'?'


In [37]:
paragraph.xpathEval('string(self::*)')

'Is this a link?'

`child::text()` selector all children nodes that are also text nodes. "a" is part of the `<a>` inside the paragraph, sot it's not selected.

Whereas `string(self::*)` applies to the paragraph (the context node, selected with `self::*`) and recursively gets text content of children, children of children and so on.

## Predicates

_axis_ :: _nodetest_ ** [PREDICATE]* **

Predicates are the last part of each step in a location path. Predicates are optional.

They are used to further filter nodes on properties that cannot be expressed with the step's axis and node test.

Remember that XPath location paths work step by step. Each step produces a node-set for each node from the previous step's node-set, with possibly more than 1 node in each node set.

You may not be interested in all nodes from a node test.

The syntax for predicates is simple: just surround conditions withing square brackets. What's inside the square brackets can be:

- a number (see positional predicates below)
- a location path: the predicate will select nodes for which the location path matches at least a node
- a boolean operation: for example to test a condition on text content or count of children

### Positional predicates

The first use-case is selecting nodes based on their position in a node-set. Node-sets order depends on the axis, but let's consider that the order of a node in a node-set is the document order.

Let's say we don't want the 2 paragraphs in the `<div>` we looked at earlier, only the first one:

In [38]:
for paragraph in doc.xpathEval('//body/div/div/p'):
    print(paragraph.serialize())

<p>This is a paragraph.</p>
<p>Is this <a href="page2.html">a link</a>?</p>


In [39]:
for paragraph in doc.xpathEval('//body/div/div/p[1]'):
    print(paragraph.serialize())

<p>This is a paragraph.</p>


If you want the last node in a node-set, you can use `last()`:

In [40]:
for div in doc.xpathEval('//body/div/div[last()]'):
    print(div.serialize())

<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>


#### Position ranges

TODO: things like `//table/tbody/tr[position() > 2]`

### Location paths as predicates

TODO: things like `//table[tr/div/a]`

### Boolean predicates

TODO: things like `//table[count(tr)=10]`

#### Special case of string value tests

TODO: things like `//table[.//img/@src="pic.png"]` or `//table[th="Some headers"]`

#### Special trick for testing multiple node names

TODO: things like `./descendant-or-self::*[self::ul or self::ol]`

### Nested predicates

We said that location paths can be used as predicate. And location paths can have predicates. So it's possible end up with nested predicates.

In [41]:
#                               <------predicate --------->
#                                  <-nested predicate->
for div in doc.xpathEval('//div[p  [a/@href="page2.html"]  ]'):
    print(div.serialize())

<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>


In fact, the above is equivalent to `//div[p/a/@href="page2.html"]` with no nesting:

In [42]:
for div in doc.xpathEval('//div[p/a/@href="page2.html"]'):
    print(div.serialize())

<div>
      <p>This is a paragraph.</p>
      <p>Is this <a href="page2.html">a link</a>?</p>
      <br>
      Apparently.
    </div>


### Order of predicates is important

You can have multiple predicates in sequence per step, each within its `[]` brackets, i.e. steps in the form of `axis::nodetest[predicate#1][predicate#2][predicate#3]...`

Predicates are processed in order, from left to right. And the output of one predicate is fed into the next predicate filter, much like steps produce node-sets for the next step to process.

So the order of predicates is important.

The following 2 location paths produce different results:


In [43]:
for div in doc.xpathEval('//div[2][@class="second"]'):
    print(div.serialize())

<div class="second">
      Nothing to add.
      Except maybe this <a href="page3.html">other link</a>. 
      <!-- And this comment -->
    </div>


In [44]:
for div in doc.xpathEval('//div[@class="second"][2]'):
    print(div.serialize())

The 2nd produces nothing. Why is that?

`//div[2][@class="second"]` looks at `div` elements that are the 2nd child of their parent (because `div` means `child::div`, and `[2]` will select the 2nd node in the current node-set.
In our document this happens only once.
The final predicate, `[@class="second"]`, filter nodes that have a "class" attribute with value "second". This happens to be valid for that 2nd child `div`.

On the contrary, `//div[@class="second"][2]` will first produce `//div[@class="second"]`, which only produces single-node node-sets (again, there's only 1 `div` with "class" attribute with value "second").
So the subsequent `[2]` predicate will never match with single-node node-sets.

### Abbreviation cheatsheet

| Abbreviated step             | Meaning    
|------------------------------|-------------------------------------------------
| `*` (asterisk)               | all **element** nodes (i.e. not text nodes, not attribute nodes;
|                              | remember that `.//*` is not the same as `.//node()`;
|                              | also, there's not `element()` node test
| `@*`                         | `attribute::*` (all attribute nodes)
| `//`                         | `/descendant-or-self::node()/
|                              | exactly this, nothing more,  nothing less,
|                              | so `//*` is not the same as `/descendant-or-self::*`
| `.` (a single dot)           | `self::node()`, the context node; useful for formation a relative XPath
|                              | e.g. `.//tr`
| `..` (2 dots)                | `parent::node()`

TODO: explain why `//*` is not the same as `/descendant-or-self::*`

## String functions

TODO

# Part 3: Use-cases for web scraping

TODO

## Text extraction

TODO

## Attributes extraction

TODO

### Attribute names extractions

TODO

## CSS Selectors

TODO

## Loop on elements (table rows, lists)

TODO

## Element boundaries & XPath buckets (advanced)

TODO

## EXSLT extensions

TODO

# Summary of tips

* Use relative XPath expressions whenever possible
* Know your axes!
* Don't forget that XPath has `string()` and `normalize-space()` functions
* **`text()` is a node test**, not a function call
* CSS selectors are very handy, easier to maintain, but also less powerful than XPath