## XPath

XPath uses "path-like" syntax to identify and navigate nodes in an XML document. These path expressions look very much like the path expressions you use with traditional computer file systems.

HTML pages are treated as **trees of nodes**. The topmost element of the tree is called the `root` element.

In [1]:
# In python, xpath is supported by the "lxml" module.
from lxml import html

In [2]:
html_page = """
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Jason Schloetzer</li>
            <li>Bill Smith</li>
            <li>Barney Stinson</li>
        </ul>
    </body>
</html>
"""

Any HTML document is equivalent to a tree:
![html_tree](img/html_tree.png)

In [3]:
tree = html.fromstring(html_page)

In [4]:
tree

<Element html at 0x106a75db8>

- Each element has one parent.
- Elements may have 0 or any number of children.
- **`Siblings`** = nodes with the same parent
- **`Ancestors`** = the node's parent, parent's parent, etc.
- **`Descendants`** = the node's children, children's children, etc.

In XPath, there are 7 kinds of nodes: 
- element
- attribute
- text
- namespace
- processing-instruction
- comment
- document nodes

We'll only focus on the first 3.

In [5]:
tree.getchildren()

[<Element head at 0x106a75ea8>, <Element body at 0x106a89318>]

In [6]:
tree.getchildren()[1].getparent()

<Element html at 0x106a75db8>

In [7]:
tree.getchildren()[1].getchildren()

[<Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>]

In [8]:
tree.getchildren()[1].getchildren()[2]

<Element a at 0x106a898b8>

In [9]:
tree.getchildren()[1].getchildren()[2].text

'This is an awesome link'

In [10]:
tree.xpath('//a')

[<Element a at 0x106a898b8>]

In [14]:
tree.xpath('//a')[0].text

'This is an awesome link'

In [56]:
tree.xpath('//ul')[0].text

'\n            '

In [57]:
tree.xpath('//ul')[0].itertext()

<lxml.etree.ElementTextIterator at 0x106b218d0>

In [58]:
list(tree.xpath('//ul')[0].itertext())

['\n            ',
 'Jason Schloetzer',
 '\n            ',
 'Bill Smith',
 '\n            ',
 'Barney Stinson',
 '\n        ']

In [63]:
', '.join([i.strip() for i in list(tree.xpath('//ul')[0].itertext()) 
           if i.strip()]).strip()

'Jason Schloetzer, Bill Smith, Barney Stinson'

Expression|Description
:---: | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes

- If a path starts with a ` / ` it always represents an absolute path
- If a path starts with a ` . ` it always represents a relative path

In [30]:
tree.xpath('/a')

[]

In [31]:
tree.xpath('.//a')

[<Element a at 0x106a898b8>]

In [32]:
tree.xpath('.//a/..')

[<Element body at 0x106a89318>]

In [33]:
tree.xpath('//@id')

['unordered_list']

## Predicates

**`Predicates`** are embedded in square brackets. They are used to find a specific node, or a node that contains a specific value.

Path Expression|Result
--- | ---
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` attribute with a value of "attribute_value"

In [23]:
tree.xpath('//li[1]')

[<Element li at 0x106acbcc8>]

In [24]:
tree.xpath('//li[last()]')

[<Element li at 0x106b0aae8>]

In [28]:
tree.xpath("//ul[@id]")

[<Element ul at 0x106a89908>]

In [29]:
tree.xpath("//ul[@id='unordered_list']")

[<Element ul at 0x106a89908>]

## Wildcards

Wildcard|Description
--- | ---
`*`|Matches any element node
`node()`|Matches any node of any kind

In [40]:
tree.xpath('*')

[<Element head at 0x106a75ea8>, <Element body at 0x106a89318>]

In [41]:
tree.xpath('//*')

[<Element html at 0x106a75db8>,
 <Element head at 0x106a75ea8>,
 <Element title at 0x106b6ed18>,
 <Element body at 0x106a89318>,
 <Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>,
 <Element li at 0x106acbcc8>,
 <Element li at 0x106acbef8>,
 <Element li at 0x106b0aae8>]

In [50]:
tree.xpath('//a')[0].xpath('//*')

[<Element html at 0x106a75db8>,
 <Element head at 0x106a75ea8>,
 <Element title at 0x106b6ed18>,
 <Element body at 0x106a89318>,
 <Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>,
 <Element li at 0x106acbcc8>,
 <Element li at 0x106acbef8>,
 <Element li at 0x106b0aae8>]

In [43]:
tree.xpath('//node()')

[<Element html at 0x106a75db8>,
 '\n    ',
 <Element head at 0x106a75ea8>,
 '\n        ',
 <Element title at 0x106b6ed18>,
 'Page Title',
 '\n    ',
 '\n    ',
 <Element body at 0x106a89318>,
 '\n        ',
 <Element h1 at 0x106a895e8>,
 'This is an amazing heading',
 '\n        ',
 <Element p at 0x106a89868>,
 'This is a fantastic paragraph.',
 '\n        ',
 <Element a at 0x106a898b8>,
 'This is an awesome link',
 '\n        ',
 <Element ul at 0x106a89908>,
 '\n            ',
 <Element li at 0x106acbcc8>,
 'Jason Schloetzer',
 '\n            ',
 <Element li at 0x106acbef8>,
 'Bill Smith',
 '\n            ',
 <Element li at 0x106b0aae8>,
 'Barney Stinson',
 '\n        ',
 '\n    ',
 '\n']

## Summary

Expression|Description
--- | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` attribute with a value of "attribute_value"
`*`|Matches any element node
`node()`|Matches any node of any kind

## Exercises

In [65]:
# Exercise 1: select any node that has an attribute

In [66]:
# Exercise 2: select any node that doesn't have text

In [67]:
# Exercise 3: get the text of the whole HTML page

In [70]:
# Exercise 4: get the title of the page

In [69]:
# Exercise 5: build a dictionary of all links in the page {text: link} 