### HTML

#### Terminology

The **XHTML** format is a specific and strict version of **HTML**, created at a time when there were initiatives to bring the *HTML* format (which is not strict) closer to the **XML** format.

More generally, **HTML** formats can be assimilated to the **XML** format.

When these files are malformed in the **XML** sense, there are heuristics in place to still understand them. Overall, these heuristics are standardized so that each browser interprets them the same way.

There is a Python module that can apply these heuristics and allows reading **HTML** content, even if it is malformed.

#### Loading an HTML Document

In [None]:
# You should execute this line to install lxml
import subprocess
print(subprocess.getstatusoutput("pip install beautifulsoup4"))

In [None]:
with open('document.html') as f:
    print(f.read())

In [None]:
from bs4 import BeautifulSoup

In [None]:
with open('document.html') as f:
    soup = BeautifulSoup(f.read())

In [None]:
for line in soup.prettify().splitlines():
    print(line)

In [None]:
soup.html.body.p.text

In [None]:
soup.html.body.p.findNextSibling().attrs

Parse like SAX
--

In [None]:
from lxml.etree import HTMLParser

In [None]:
with open('document.html') as f:
    parser = HTMLParser()
    parser.feed(f.read())

In [None]:
class TitleHTMLParser(HTMLParser):
    capture = False

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.capture = True

    def handle_endtag(self, tag):
        self.capture = False

    def handle_data(self, data):
        if self.capture == True:
            print(data)


In [None]:
with open('document.html') as f:
    parser = TitleHTMLParser()
    parser.feed(f.read())

----