## Hyper Text Markup Language (HTML) and XPath
---

### HTML
---

HTML is the standard markup language for creating Web pages.

- Its elements are the building blocks of HTML pages and are represented by tags
- HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
- There are more than 100 different tags
- Browsers do not display the HTML tags, but use them to render the content of the page

Example:

```html
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Bill Baber</li>
            <li>David Erkens</li>
            <li>Patricia Fairfield</li>
            <li>Gilles Hilary</li>
            <li>Allison Koester</li>
            <li>Silva Kurtisa</li>
            <li>Reining Petacchi</li>
            <li>Jason Schloetzer</li>
            <li>Vicki Tang</li>
            <li>Xiaoli Tian</li>
        </ul>
    </body>
</html>```

- `<!DOCTYPE html>`: declaration that defines this document to be HTML (in particular, HTML5). It helps browsers to display web pages correctly. It must only appear once, at the top of the page.
- `<html>`: root element of an HTML page
- `<head>`: element that contains metadata (data about data) for the document, not displayed.
- `<title>`: element that specifies a title for the document
- `<body>`: element that contains the **visible content** 
- `<h1>`: element that defines a large heading. `<h1>` defines the most important heading, `<h6>` defines the least important heading.
- `<p>:` element that defines a paragraph. Extra spaces and new lines are ignored.
- `<a>`: element that defines a link ("anchor"). The link's destination is specified in the `href` attribute. 
- `<ul>`: element that defines an unordered list (`<ol>` for ordered lists), whose atomic elements are `<li>`. Any element can have an `id` attribute.

In [2]:
from IPython.display import display, HTML

s = """
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Bill Baber</li>
            <li>David Erkens</li>
            <li>Patricia Fairfield</li>
            <li>Gilles Hilary</li>
            <li>Allison Koester</li>
            <li>Silva Kurtisa</li>
            <li>Reining Petacchi</li>
            <li>Jason Schloetzer</li>
            <li>Vicki Tang</li>
            <li>Xiaoli Tian</li>
        </ul>
    </body>
</html>
"""

h = HTML(s)
display(h)

As you can see, HTML elements can be nested: `<li>` is inside `<ul>` which is inside `<body>` which is inside `<html>`. 

**HTML is a tree**

HTML tags are element names surrounded by angle brackets:

`<tagname>content</tagname>`

- HTML tags come in pairs like `<p>` and `</p>`
- The first tag in a pair is the start tag, or opening tag
- The second tag is the end tag, or closing tag
- The end tag is written like the start tag, but with a forward slash inserted before the tag name
- The browser does not display the HTML tags, but uses them to determine how to display the document.
- Attributes in the start tag are used to provide additional information about HTML elements.
- Empty elements don't have an end tag (e.g. `<br>`, line break)
- Some HTML elements will display correctly even without the end tag, browser are error tolerant. But it makes them harder to parse.

### HTML tables
---

- An HTML table is defined with the `<table>` tag.
- Each table row is defined with the `<tr>` tag. 
- A table header is defined with the `<th>` tag. 
- A table data/cell is defined with the `<td>` tag.

```html
<table style="width:20%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jason</td>
    <td>Schloetzer</td> 
    <td>28</td>
  </tr>
</table>```

In [7]:
from IPython.display import display, HTML

s = """
<table style="width:20%">
  <tr>
    <th>First name</th>
    <th>Last name</th> 
    <th>Age</th>
  </tr>
  <tr>
    <td>Jason</td>
    <td>Schloetzer</td> 
    <td>28</td>
  </tr>
</table>
"""

h = HTML(s)
display(h)

First name,Last name,Age
Jason,Schloetzer,28


- To make a cell span more than one column, use the `colspan` attribute
- To make a cell span more than one row, use the `rowspan` attribute

```html
<table style="width:20%">
  <tr>
    <th>Name</th>
    <th colspan="2">Telephone</th>
  </tr>
  <tr>
    <td>Bill Gates</td>
    <td>55577854</td>
    <td>55577855</td>
  </tr>
</table>```

In [10]:
from IPython.display import display, HTML

s = """
<table style="width:20%">
  <tr>
    <th>Name</th>
    <th colspan="6">Companies</th>
  </tr>
  <tr>
    <td>Elon Musk</td>
    <td>PayPal</td>
    <td>SpaceX</td>
    <td>Tesla</td>
    <td>SolarCity</td>
    <td>Hyperloop</td>
    <td>OpenAI</td>
  </tr>
  <tr>
    <td>Mark Cuban</td>
    <td>Dallas Mavericks</td>
    <td>2929 Entertainment</td>
    <td>AXS TV</td>
    <td>Magnolia Pictures</td>
    <td>Billshark</td>
    <td>Alyssa's</td>
  </tr>
</table>
"""

h = HTML(s)
display(h)

Name,Companies,Companies.1,Companies.2,Companies.3,Companies.4,Companies.5
Elon Musk,PayPal,SpaceX,Tesla,SolarCity,Hyperloop,OpenAI
Mark Cuban,Dallas Mavericks,2929 Entertainment,AXS TV,Magnolia Pictures,Billshark,Alyssa's


### HTML Entities
---

- Reserved characters in HTML must be replaced with `character entities`
- e.g. if you use `<` or `>` signs in your text, the browser might mix them with tags

Result|Description|Entity Name|Entity Number
:---: | --- | --- | ---
` `|non-breaking space|`&nbsp`;|`&#160;`
`<`|less than|`&lt;`|`&#60;`
`>`|greater than|`&gt;`|`&#62;`
`&`|ampersand|`&amp;`|`&#38;`
`"`|double quotation mark|`&quot;`|`&#34;`
`'`|single quotation mark|`&apos;`|`&#39;`

### XPath
---

- Should we use `regexes` to parse `HTML`? Better a proper `HTML` parser like `lxml.html`.

- This will parse the webpage into a tree-like object, that can be queried with `XPath`.

- `XPath` uses "path-like" syntax (like computer file systems )to identify and navigate nodes in an XML document. 

- `HTML` pages are treated as **trees of nodes**. The topmost element of the tree is called the `root` element.

In [14]:
# In python, XPath is supported by the "lxml" module.
from lxml import html

In [15]:
html_page = """
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>This is an amazing heading</h1>
        <p>This is a fantastic paragraph.</p>
        <a href="https://www.georgetown.edu/">This is an awesome link</a>
        <ul id="unordered_list">
            <li>Bill Baber</li>
            <li>David Erkens</li>
            <li>Patricia Fairfield</li>
            <li>Gilles Hilary</li>
            <li>Allison Koester</li>
            <li>Silva Kurtisa</li>
            <li>Reining Petacchi</li>
            <li>Jason Schloetzer</li>
            <li>Vicki Tang</li>
            <li>Xiaoli Tian</li>
        </ul>
    </body>
</html>
"""

Any HTML document is equivalent to a tree:
![html_tree](../images/html_tree.png)

In [16]:
tree = html.fromstring(html_page)

In [17]:
tree

<Element html at 0x10dd32b38>

- Each element has one parent.
- Elements may have 0 or any number of children.
- **`Siblings`** = nodes with the same parent
- **`Ancestors`** = the node's parent, parent's parent, etc.
- **`Descendants`** = the node's children, children's children, etc.

In XPath, there are 7 kinds of nodes: 
- element
- attribute
- text
- namespace
- processing-instruction
- comment
- document nodes

We'll only focus on the first 3.

In [21]:
tree.getchildren()

[<Element head at 0x10e3e2778>, <Element body at 0x10e3e27c8>]

In [22]:
tree.getchildren()[1].getparent()

<Element html at 0x10dd32b38>

In [23]:
tree.getchildren()[1].getchildren()

[<Element h1 at 0x10e3e29a8>,
 <Element p at 0x10e3e29f8>,
 <Element a at 0x10e3e2a48>,
 <Element ul at 0x10e3e2a98>]

In [24]:
tree.getchildren()[1].getchildren()[2]

<Element a at 0x10e3e2a48>

In [25]:
tree.getchildren()[1].getchildren()[2].text

'This is an awesome link'

In [26]:
tree.xpath('//a')

[<Element a at 0x10e3e2a48>]

In [27]:
tree.xpath('//a')[0].text

'This is an awesome link'

In [28]:
tree.xpath('//ul')[0].text

'\n            '

In [29]:
tree.xpath('//ul')[0].itertext()

<lxml.etree.ElementTextIterator at 0x10e3d4f98>

In [30]:
list(tree.xpath('//ul')[0].itertext())

['\n            ',
 'Bill Baber',
 '\n            ',
 'David Erkens',
 '\n            ',
 'Patricia Fairfield',
 '\n            ',
 'Gilles Hilary',
 '\n            ',
 'Allison Koester',
 '\n            ',
 'Silva Kurtisa',
 '\n            ',
 'Reining Petacchi',
 '\n            ',
 'Jason Schloetzer',
 '\n            ',
 'Vicki Tang',
 '\n            ',
 'Xiaoli Tian',
 '\n        ']

In [31]:
', '.join([i.strip() for i in list(tree.xpath('//ul')[0].itertext()) 
           if i.strip()]).strip()

'Bill Baber, David Erkens, Patricia Fairfield, Gilles Hilary, Allison Koester, Silva Kurtisa, Reining Petacchi, Jason Schloetzer, Vicki Tang, Xiaoli Tian'

Expression|Description
:---: | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes in the document from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes

- If a path starts with a ` / ` it always represents an absolute path
- If a path starts with a ` . ` it always represents a relative path

In [32]:
tree.xpath('/a')

[]

In [33]:
tree.xpath('.//a')

[<Element a at 0x10e3e2a48>]

In [36]:
tree.xpath('.//a/..')

[<Element body at 0x10e3e27c8>]

In [37]:
tree.xpath('//@id')

['unordered_list']

### Predicates
---

**`Predicates`** are embedded in square brackets. They are used to find a specific node, or a node that contains a specific value.

Path Expression|Result
--- | ---
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` attribute with a value of "attribute_value"

In [38]:
tree.xpath('//li[1]')

[<Element li at 0x10e3eaf48>]

In [39]:
tree.xpath('//li[last()]')

[<Element li at 0x10e3f04a8>]

In [40]:
tree.xpath("//ul[@id]")

[<Element ul at 0x10e3e2a98>]

In [41]:
tree.xpath("//ul[@id='unordered_list']")

[<Element ul at 0x10e3e2a98>]

### Wildcards
---

Wildcard|Description
--- | ---
`*`|Matches any element node
`node()`|Matches any node of any kind

In [42]:
tree.xpath('*')

[<Element head at 0x10e3e2778>, <Element body at 0x10e3e27c8>]

In [43]:
tree.xpath('//*')

[<Element html at 0x10dd32b38>,
 <Element head at 0x10e3e2778>,
 <Element title at 0x10e3f06d8>,
 <Element body at 0x10e3e27c8>,
 <Element h1 at 0x10e3e29a8>,
 <Element p at 0x10e3e29f8>,
 <Element a at 0x10e3e2a48>,
 <Element ul at 0x10e3e2a98>,
 <Element li at 0x10e3eaf48>,
 <Element li at 0x10e3f0958>,
 <Element li at 0x10e3f09a8>,
 <Element li at 0x10e3f09f8>,
 <Element li at 0x10e3f0a48>,
 <Element li at 0x10e3f0a98>,
 <Element li at 0x10e3f0ae8>,
 <Element li at 0x10e3f0b38>,
 <Element li at 0x10e3f0b88>,
 <Element li at 0x10e3f04a8>]

In [44]:
tree.xpath('//a')[0].xpath('//*')

[<Element html at 0x10dd32b38>,
 <Element head at 0x10e3e2778>,
 <Element title at 0x10e3f06d8>,
 <Element body at 0x10e3e27c8>,
 <Element h1 at 0x10e3e29a8>,
 <Element p at 0x10e3e29f8>,
 <Element a at 0x10e3e2a48>,
 <Element ul at 0x10e3e2a98>,
 <Element li at 0x10e3eaf48>,
 <Element li at 0x10e3f0958>,
 <Element li at 0x10e3f09a8>,
 <Element li at 0x10e3f09f8>,
 <Element li at 0x10e3f0a48>,
 <Element li at 0x10e3f0a98>,
 <Element li at 0x10e3f0ae8>,
 <Element li at 0x10e3f0b38>,
 <Element li at 0x10e3f0b88>,
 <Element li at 0x10e3f04a8>]

In [45]:
tree.xpath('//node()')

[<Element html at 0x10dd32b38>,
 '\n    ',
 <Element head at 0x10e3e2778>,
 '\n        ',
 <Element title at 0x10e3f06d8>,
 'Page Title',
 '\n    ',
 '\n    ',
 <Element body at 0x10e3e27c8>,
 '\n        ',
 <Element h1 at 0x10e3e29a8>,
 'This is an amazing heading',
 '\n        ',
 <Element p at 0x10e3e29f8>,
 'This is a fantastic paragraph.',
 '\n        ',
 <Element a at 0x10e3e2a48>,
 'This is an awesome link',
 '\n        ',
 <Element ul at 0x10e3e2a98>,
 '\n            ',
 <Element li at 0x10e3eaf48>,
 'Bill Baber',
 '\n            ',
 <Element li at 0x10e3f0958>,
 'David Erkens',
 '\n            ',
 <Element li at 0x10e3f09a8>,
 'Patricia Fairfield',
 '\n            ',
 <Element li at 0x10e3f09f8>,
 'Gilles Hilary',
 '\n            ',
 <Element li at 0x10e3f0a48>,
 'Allison Koester',
 '\n            ',
 <Element li at 0x10e3f0a98>,
 'Silva Kurtisa',
 '\n            ',
 <Element li at 0x10e3f0ae8>,
 'Reining Petacchi',
 '\n            ',
 <Element li at 0x10e3f0b38>,
 'Jason Schl

### Summary
---

Expression|Description
--- | ---
`tagname`|Selects all nodes with the name "tagname"
`/`|Selects from the root node
`//`|Selects nodes from the current node that match the selection no matter where they are
`.`|Selects the current node
`..`|Selects the parent of the current node
`@`|Selects attributes
`tagname[n]`|Selects the `n`-th `tagname` element.
`tagname[last()]`|Selects the last `tagname` element
`tagname[last()-1]`|Selects the last but one `tagname` element
`tagname[position()<3]`|Selects the first two `tagname` elements
`tagname[@attribute_name]`|Selects all the `tagname` elements that have an attribute named attribute_name
`tagname[@attribute_name='attribute_value']`|Selects all the `tagname` elements that have an `attribute_name` with the given value
`*`|Matches any element node
`node()`|Matches any node of any kind

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Exercise html.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Given the HTML page above:
    <ol style="margin-left: 200px;
               margin-right: 100px;
               line-height: 1.7em;">
        <li>select all nodes that have an attribute</li>
        <li>select all nodes that don't have text</li>
        <li>get the text of the whole HTML page</li>
        <li>get the title of the page</li>
        <li>build a dictionary of all links in the page {text: link}</li>
    </ol>
    </p></a></font>
</div>