## Hyper Text Markup Language (HTML) and XPath

### HTML

HTML is the standard markup language for creating Web pages.

- Its elements are the building blocks of HTML pages and are represented by tags
- HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
- There are more than 100 different tags
- Browsers do not display the HTML tags, but use them to render the content of the page

Example:

```html
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Jason Schloetzer</li>
            <li>Bill Smith</li>
            <li>Barney Stinson</li>
        </ul>
    </body>
</html>```

Let's save it in a text document and open it in a browser. 
You can use either `.htm` or `.html` as file extension.

- `<!DOCTYPE html>`: declaration that defines this document to be HTML (in particular, HTML5). It helps browsers to display web pages correctly. It must only appear once, at the top of the page.
- `<html>`: root element of an HTML page
- `<head>`: element that contains metadata (data about data) for the document, not displayed.
- `<title>`: element that specifies a title for the document
- `<body>`: element that contains the **visible content** 
- `<h1>`: element that defines a large heading. `<h1>` defines the most important heading, `<h6>` defines the least important heading.
- `<p>:` element that defines a paragraph. Extra spaces and new lines are ignored.
- `<a>`: element that defines a link ("anchor"). The link's destination is specified in the `href` attribute. 
- `<ul>`: element that defines an unordered list (`<ol>` for ordered lists), whose atomic elements are `<li>`. Any element can have an `id` attribute.

As you can see, HTML elements can be nested: `<li>` is inside `<ul>` which is inside `<body>` which is inside `<html>`. 

**HTML is a tree**

HTML tags are element names surrounded by angle brackets:

`<tagname>content</tagname>`

- HTML tags come in pairs like `<p>` and `</p>`
- The first tag in a pair is the start tag, or opening tag
- The second tag is the end tag, or closing tag
- The end tag is written like the start tag, but with a forward slash inserted before the tag name
- The browser does not display the HTML tags, but uses them to determine how to display the document.
- Attributes in the start tag are used to provide additional information about HTML elements.
- Empty elements don't have an end tag (e.g. `<br>`, line break)
- Some HTML elements will display correctly even without the end tag, browser are error tolerant. But it makes them harder to parse.

### HTML tables


- An HTML table is defined with the `<table>` tag.
- Each table row is defined with the `<tr>` tag. 
- A table header is defined with the `<th>` tag. 
- A table data/cell is defined with the `<td>` tag.

```html
<table style="width:20%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>```

- To make a cell span more than one column, use the `colspan` attribute
- To make a cell span more than one row, use the `rowspan` attribute

```html
<table style="width:20%">
  <tr>
    <th>Name</th>
    <th colspan="2">Telephone</th>
  </tr>
  <tr>
    <td>Bill Gates</td>
    <td>55577854</td>
    <td>55577855</td>
  </tr>
</table>```

### HTML Block, Inline, Entities

- A block-level element always starts on a new line and takes up the full width available
    - e.g. `<div>`, a generic container for other elements
- An inline element does not start on a new line, and only takes up as much width as necessary
    - e.g. `<span>`, a generic container for text

- Reserved characters in HTML must be replaced with `character entities`
- e.g. if you use `<` or `>` signs in your text, the browser might mix them with tags

Result|Description|Entity Name|Entity Number
:---: | --- | --- | ---
` `|non-breaking space|`&nbsp`;|`&#160;`
`<`|less than|`&lt;`|`&#60;`
`>`|greater than|`&gt;`|`&#62;`
`&`|ampersand|`&amp;`|`&#38;`
`"`|double quotation mark|`&quot;`|`&#34;`
`'`|single quotation mark|`&apos;`|`&#39;`

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Exercise html.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Create a webpage that looks like this:</p></a></font>
</div>

### XPath

XPath uses "path-like" syntax to identify and navigate nodes in an XML document. These path expressions look very much like the path expressions you use with traditional computer file systems.

HTML pages are treated as **trees of nodes**. The topmost element of the tree is called the `root` element.

In [1]:
# In python, xpath is supported by the "lxml" module.
from lxml import html

In [1]:
html_page = """
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Jason Schloetzer</li>
            <li>Bill Smith</li>
            <li>Barney Stinson</li>
        </ul>
    </body>
</html>
"""

Any HTML document is equivalent to a tree:
![html_tree](../images/html_tree.png)

In [3]:
tree = html.fromstring(html_page)

In [4]:
tree

<Element html at 0x106a75db8>

- Each element has one parent.
- Elements may have 0 or any number of children.
- **`Siblings`** = nodes with the same parent
- **`Ancestors`** = the node's parent, parent's parent, etc.
- **`Descendants`** = the node's children, children's children, etc.

In XPath, there are 7 kinds of nodes: 
- element
- attribute
- text
- namespace
- processing-instruction
- comment
- document nodes

We'll only focus on the first 3.

In [5]:
tree.getchildren()

[<Element head at 0x106a75ea8>, <Element body at 0x106a89318>]

In [6]:
tree.getchildren()[1].getparent()

<Element html at 0x106a75db8>

In [7]:
tree.getchildren()[1].getchildren()

[<Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>]

In [8]:
tree.getchildren()[1].getchildren()[2]

<Element a at 0x106a898b8>

In [9]:
tree.getchildren()[1].getchildren()[2].text

'This is an awesome link'

In [10]:
tree.xpath('//a')

[<Element a at 0x106a898b8>]

In [14]:
tree.xpath('//a')[0].text

'This is an awesome link'

In [56]:
tree.xpath('//ul')[0].text

'\n            '

In [57]:
tree.xpath('//ul')[0].itertext()

<lxml.etree.ElementTextIterator at 0x106b218d0>

In [58]:
list(tree.xpath('//ul')[0].itertext())

['\n            ',
 'Jason Schloetzer',
 '\n            ',
 'Bill Smith',
 '\n            ',
 'Barney Stinson',
 '\n        ']

In [63]:
', '.join([i.strip() for i in list(tree.xpath('//ul')[0].itertext()) 
           if i.strip()]).strip()

'Jason Schloetzer, Bill Smith, Barney Stinson'

Expression|Description
:---: | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes

- If a path starts with a ` / ` it always represents an absolute path
- If a path starts with a ` . ` it always represents a relative path

In [30]:
tree.xpath('/a')

[]

In [31]:
tree.xpath('.//a')

[<Element a at 0x106a898b8>]

In [32]:
tree.xpath('.//a/..')

[<Element body at 0x106a89318>]

In [33]:
tree.xpath('//@id')

['unordered_list']

### Predicates

**`Predicates`** are embedded in square brackets. They are used to find a specific node, or a node that contains a specific value.

Path Expression|Result
--- | ---
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` attribute with a value of "attribute_value"

In [23]:
tree.xpath('//li[1]')

[<Element li at 0x106acbcc8>]

In [24]:
tree.xpath('//li[last()]')

[<Element li at 0x106b0aae8>]

In [28]:
tree.xpath("//ul[@id]")

[<Element ul at 0x106a89908>]

In [29]:
tree.xpath("//ul[@id='unordered_list']")

[<Element ul at 0x106a89908>]

### Wildcards

Wildcard|Description
--- | ---
`*`|Matches any element node
`node()`|Matches any node of any kind

In [40]:
tree.xpath('*')

[<Element head at 0x106a75ea8>, <Element body at 0x106a89318>]

In [41]:
tree.xpath('//*')

[<Element html at 0x106a75db8>,
 <Element head at 0x106a75ea8>,
 <Element title at 0x106b6ed18>,
 <Element body at 0x106a89318>,
 <Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>,
 <Element li at 0x106acbcc8>,
 <Element li at 0x106acbef8>,
 <Element li at 0x106b0aae8>]

In [50]:
tree.xpath('//a')[0].xpath('//*')

[<Element html at 0x106a75db8>,
 <Element head at 0x106a75ea8>,
 <Element title at 0x106b6ed18>,
 <Element body at 0x106a89318>,
 <Element h1 at 0x106a895e8>,
 <Element p at 0x106a89868>,
 <Element a at 0x106a898b8>,
 <Element ul at 0x106a89908>,
 <Element li at 0x106acbcc8>,
 <Element li at 0x106acbef8>,
 <Element li at 0x106b0aae8>]

In [43]:
tree.xpath('//node()')

[<Element html at 0x106a75db8>,
 '\n    ',
 <Element head at 0x106a75ea8>,
 '\n        ',
 <Element title at 0x106b6ed18>,
 'Page Title',
 '\n    ',
 '\n    ',
 <Element body at 0x106a89318>,
 '\n        ',
 <Element h1 at 0x106a895e8>,
 'This is an amazing heading',
 '\n        ',
 <Element p at 0x106a89868>,
 'This is a fantastic paragraph.',
 '\n        ',
 <Element a at 0x106a898b8>,
 'This is an awesome link',
 '\n        ',
 <Element ul at 0x106a89908>,
 '\n            ',
 <Element li at 0x106acbcc8>,
 'Jason Schloetzer',
 '\n            ',
 <Element li at 0x106acbef8>,
 'Bill Smith',
 '\n            ',
 <Element li at 0x106b0aae8>,
 'Barney Stinson',
 '\n        ',
 '\n    ',
 '\n']

### Summary

Expression|Description
--- | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` attribute with a value of "attribute_value"
`*`|Matches any element node
`node()`|Matches any node of any kind

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Exercise html.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Given the HTML page above:
    <ol style="margin-left: 200px;
               margin-right: 100px;
               line-height: 1.7em;">
        <li>select all nodes that have an attribute</li>
        <li>select all nodes that don't have text</li>
        <li>get the text of the whole HTML page</li>
        <li>get the title of the page</li>
        <li>build a dictionary of all links in the page {text: link}</li>
    </ol>
    </p></a></font>
</div>