# Parsing HTML and XML

In [1]:
import requests

In [2]:
resp_h5 = requests.get('https://www.google.com')
resp_h4 = requests.get('http://www.duckduckgo.com', allow_redirects=False)   # 301 redirect

In [3]:
html5_content = resp_h5.content
html4_content = resp_h4.content

In [4]:
html4_content

b'<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body>\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

In [5]:
html5_content

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2021/day-of-the-dead-2021-6753651837109263-law.gif" itemprop="image"><meta content="Day of the Dead 2021" property="twitter:title"><meta content="Happy Day of the Dead 2021! #GoogleDoodle" property="twitter:description"><meta content="Happy Day of the Dead 2021! #GoogleDoodle" property="og:description"><meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site"><meta content="https://www.google.com/logos/doodles/2021/day-of-the-dead-2021-6753651837109263-2xa.gif" property="twitter:image"><meta content="https://www.google

## Parsing HTML using HTMLParser

In [6]:
from html.parser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag: {} (attrs {})".format(tag, attrs))

    def handle_endtag(self, tag):
        print("Encountered a end tag: {}".format(tag))

    def handle_data(self, data):
        '''Do nothing here'''

In [7]:
parser = MyHTMLParser()
parser.feed(str(html4_content))

Encountered a start tag: html (attrs [])
Encountered a start tag: head (attrs [])
Encountered a start tag: title (attrs [])
Encountered a end tag: title
Encountered a end tag: head
Encountered a start tag: body (attrs [])
Encountered a start tag: center (attrs [])
Encountered a start tag: h1 (attrs [])
Encountered a end tag: h1
Encountered a end tag: center
Encountered a start tag: hr (attrs [])
Encountered a start tag: center (attrs [])
Encountered a end tag: center
Encountered a end tag: body
Encountered a end tag: html


In [8]:
parser.feed(str(html5_content))

Encountered a start tag: html (attrs [('itemscope', ''), ('itemtype', 'http://schema.org/WebPage'), ('lang', 'en')])
Encountered a start tag: head (attrs [])
Encountered a start tag: meta (attrs [('content', "Search the world\\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\\'re looking for."), ('name', 'description')])
Encountered a start tag: meta (attrs [('content', 'noodp'), ('name', 'robots')])
Encountered a start tag: meta (attrs [('content', 'text/html; charset=UTF-8'), ('http-equiv', 'Content-Type')])
Encountered a start tag: meta (attrs [('content', '/logos/doodles/2021/day-of-the-dead-2021-6753651837109263-law.gif'), ('itemprop', 'image')])
Encountered a start tag: meta (attrs [('content', 'Day of the Dead 2021'), ('property', 'twitter:title')])
Encountered a start tag: meta (attrs [('content', 'Happy Day of the Dead 2021! #GoogleDoodle'), ('property', 'twitter:description')])
Encountered a start 

## Parsing (bad) HTML using BeautifulSoup

In [9]:
!pip install -U beautifulsoup4

Looking in links: /home/rick446/src/wheelhouse
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 3.9 MB/s eta 0:00:011
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.9.3
    Uninstalling beautifulsoup4-4.9.3:
      Successfully uninstalled beautifulsoup4-4.9.3
Successfully installed beautifulsoup4-4.10.0
You should consider upgrading via the '/home/rick446/.virtualenvs/classes/bin/python -m pip install --upgrade pip' command.[0m


In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html4_content, 'html.parser')

In [11]:
soup.find('head')

<head><title>301 Moved Permanently</title></head>

In [12]:
type(_)

bs4.element.Tag

In [13]:
soup.findAll('center')

[<center><h1>301 Moved Permanently</h1></center>, <center>nginx</center>]

In [14]:
soup.select('head > title')

[<title>301 Moved Permanently</title>]

In [15]:
!pip install html5lib

Looking in links: /home/rick446/src/wheelhouse
You should consider upgrading via the '/home/rick446/.virtualenvs/classes/bin/python -m pip install --upgrade pip' command.[0m


In [16]:
soup = BeautifulSoup(html5_content, 'html5lib')

In [17]:
soup.findAll('div')

[<div id="mngb"><div id="gbar"><nobr><b class="gb1">Search</b> <a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a> <a class="gb1" href="https://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?gl=US&amp;tab=w1">YouTube</a> <a class="gb1" href="https://news.google.com/?tab=wn">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a> <a class="gb1" href="https://www.google.com/intl/en/about/products?tab=wh" style="text-decoration:none"><u>More</u> »</a></nobr></div><div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a> | <a class="gb4" href="/preferences?hl=en">Settings</a> | <a class="gb4" href="https://accounts

In [18]:
soup.find_all(attrs={'id': 'footer'})

[<span id="footer"><div style="font-size:10pt"><div id="WqQANb" style="margin:19px auto;text-align:center"><a href="/intl/en/ads/">Advertising�Programs</a><a href="/services/">Business Solutions</a><a href="/intl/en/about.html">About Google</a></div></div><p style="font-size:8pt;color:#70757a">© 2021 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p></span>]

In [19]:
soup.select('span#footer a')

[<a href="/intl/en/ads/">Advertising�Programs</a>,
 <a href="/services/">Business Solutions</a>,
 <a href="/intl/en/about.html">About Google</a>,
 <a href="/intl/en/policies/privacy/">Privacy</a>,
 <a href="/intl/en/policies/terms/">Terms</a>]

In [20]:
soup.select('#footer')

[<span id="footer"><div style="font-size:10pt"><div id="WqQANb" style="margin:19px auto;text-align:center"><a href="/intl/en/ads/">Advertising�Programs</a><a href="/services/">Business Solutions</a><a href="/intl/en/about.html">About Google</a></div></div><p style="font-size:8pt;color:#70757a">© 2021 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p></span>]

In [21]:
soup.title

<title>Google</title>

In [22]:
links = {link.get('href') for link in soup.find_all('a')}
#links = set(link for link in links if link)
sorted(list(links))[:10]

['/advanced_search?hl=en&authuser=0',
 '/intl/en/about.html',
 '/intl/en/ads/',
 '/intl/en/policies/privacy/',
 '/intl/en/policies/terms/',
 '/preferences?hl=en',
 '/search?ie=UTF-8&q=Day+of+the+Dead&oi=ddle&ct=183975246&hl=en&si=AHuW2sT-QI6yLdyEfUBJHlK3UbYl5IMFaEVy5cpr0uPqDA9_orItmtK-RUDMXW2jBrveUiOs-PG0zrdB142sdd5j99n3k78UmE9HWvg4y2oq-n40RJasTX0%3D&sa=X&ved=0ahUKEwj8kejA0vrzAhXhl2oFHfnADlEQPQgD',
 '/services/',
 'http://www.google.com/history/optout?hl=en',
 'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ']

## Parsing HTML using lxml

`lxml` is a (very) fast HTML and XML parser. 

*possibly obsolete*: Unfortunately, it is sometimes difficult to install. Here are the [installation instructions](http://lxml.de/installation.html) for the intrepid.

If you're feeling lucky and have a C compiler installed, you can try:

```
STATIC_DEPS=true sudo pip install lxml
```

In [23]:
!pip install -U lxml

Looking in links: /home/rick446/src/wheelhouse
Collecting lxml
  Downloading lxml-4.6.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
[K     |████████████████████████████████| 6.9 MB 8.6 MB/s eta 0:00:01     |█████████████████████▌          | 4.6 MB 8.6 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
  Attempting uninstall: lxml
    Found existing installation: lxml 4.6.3
    Uninstalling lxml-4.6.3:
      Successfully uninstalled lxml-4.6.3
Successfully installed lxml-4.6.4
You should consider upgrading via the '/home/rick446/.virtualenvs/classes/bin/python -m pip install --upgrade pip' command.[0m


In [24]:
# Easy: use BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html4_content, 'lxml')

In [25]:
soup.title

<title>301 Moved Permanently</title>

In [26]:
soup = BeautifulSoup(html5_content, 'lxml')

In [27]:
soup.title

<title>Google</title>

You can also use lxml directly:

In [28]:
from lxml import etree
from io import StringIO

In [29]:
parser = etree.HTMLParser()
tree = etree.parse(StringIO(str(html5_content)), parser)

In [30]:
root = tree.getroot()

In [31]:
root.tag

'html'

In [32]:
list(root)

[<Element body at 0x7f5bbc92f180>]

In [33]:
for link in tree.findall('.//a')[:10]:
    print(link.attrib['href'])

https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en


## Parsing XML using SAX

In [34]:
xml_content = requests.get('http://feed.thisamericanlife.org/talpodcast').content

In [35]:
print(xml_content[:1000])

b'<?xml version="1.0" encoding="UTF-8"?>\r\n<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2enclosuresfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feed.thisamericanlife.org/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0" xml:base="https://www.thisamericanlife.org">\r\n<channel>\r\n <title>This American Life</title>\r\n <link>https://www.thisamericanlife.org</link>\r\n <description>This American Life is a weekly public radio show, heard by 2.2 million people on more than 500 stations. Another 2.5 million people download the weekly podcast. It is hosted by Ira Glass, produced in collaboration with Chicago Public Media, delivered to stations by PRX The Public Radio Exchange, and has won all of the major broadcasting awards.</description>\r\n <language>en</language>\r\n <copyright>C

In [36]:
import xml.sax

In [37]:
class RSSHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.in_link = False
        self.link = None
        super().__init__()
        #xml.sax.ContentHandler.__init__(self)
    def startElement(self, name, attrs):
        if name == 'link':
            self.in_link = True
            self.link = ''
    def endElement(self, name):
        if name == 'link':
            self.in_link = False
            print(self.link.strip())
            self.link = None
    def characters(self, content):
        if self.in_link:
            self.link += content

In [38]:
parser = xml.sax.make_parser()
parser.setContentHandler(RSSHandler())
parser.parse(StringIO(str(xml_content, 'utf-8')))

https://www.thisamericanlife.org
http://feed.thisamericanlife.org/~r/talpodcast/~3/C5f3oKxcJY0/an-invitation-to-tea
http://feed.thisamericanlife.org/~r/talpodcast/~3/lcq4nXWZcXo/audience-of-one
http://feed.thisamericanlife.org/~r/talpodcast/~3/K2FV8ELmBHA/the-ferryman
http://feed.thisamericanlife.org/~r/talpodcast/~3/DrsRiVRDOe0/recordings-for-someone
http://feed.thisamericanlife.org/~r/talpodcast/~3/q8IuT3Ub28k/my-bad
http://feed.thisamericanlife.org/~r/talpodcast/~3/_KDozKOQf4s/the-end-of-the-world-as-we-know-it
http://feed.thisamericanlife.org/~r/talpodcast/~3/LPqZkwRXTAI/suitable-for-children
http://feed.thisamericanlife.org/~r/talpodcast/~3/fGbHAZF0phM/this-is-just-some-songs
http://feed.thisamericanlife.org/~r/talpodcast/~3/v4ChAcYYNOI/getting-out
http://feed.thisamericanlife.org/~r/talpodcast/~3/oY0USTfxlQg/essential


## Parsing XML using minidom

In [39]:
import xml.dom.minidom

In [40]:
dom = xml.dom.minidom.parseString(xml_content)

In [41]:
dom.childNodes[2].childNodes[1].childNodes

[<DOM Text node "'\n '">,
 <DOM Element: title at 0x7f5bbc953790>,
 <DOM Text node "'\n '">,
 <DOM Element: link at 0x7f5bbc953820>,
 <DOM Text node "'\n '">,
 <DOM Element: description at 0x7f5bbc9538b0>,
 <DOM Text node "'\n '">,
 <DOM Element: language at 0x7f5bbc953940>,
 <DOM Text node "'\n '">,
 <DOM Element: copyright at 0x7f5bbc9539d0>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:author at 0x7f5bbc953a60>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:subtitle at 0x7f5bbc953af0>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:owner at 0x7f5bbc953b80>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:category at 0x7f5bbc953ca0>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:category at 0x7f5bbc953d30>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:category at 0x7f5bbc953dc0>,
 <DOM Text node "'\n '">,
 <DOM Element: itunes:image at 0x7f5bbc953f70>,
 <DOM Text node "'\n'">,
 <DOM Element: atom10:link at 0x7f5bbc8dc040>,
 <DOM Element: feedburner:info at 0x7f5bbc8dc0d0>,

## Parsing XML using ElementTree

In [42]:
from xml.etree import ElementTree

In [43]:
tree = ElementTree.parse(StringIO(str(xml_content, 'utf-8')))

In [44]:
tree.getroot()

<Element 'rss' at 0x7f5bbc8f55e0>

In [45]:
for link in tree.findall('.//link'):
    print(link.text)

https://www.thisamericanlife.org
http://feed.thisamericanlife.org/~r/talpodcast/~3/C5f3oKxcJY0/an-invitation-to-tea
http://feed.thisamericanlife.org/~r/talpodcast/~3/lcq4nXWZcXo/audience-of-one
http://feed.thisamericanlife.org/~r/talpodcast/~3/K2FV8ELmBHA/the-ferryman
http://feed.thisamericanlife.org/~r/talpodcast/~3/DrsRiVRDOe0/recordings-for-someone
http://feed.thisamericanlife.org/~r/talpodcast/~3/q8IuT3Ub28k/my-bad
http://feed.thisamericanlife.org/~r/talpodcast/~3/_KDozKOQf4s/the-end-of-the-world-as-we-know-it
http://feed.thisamericanlife.org/~r/talpodcast/~3/LPqZkwRXTAI/suitable-for-children
http://feed.thisamericanlife.org/~r/talpodcast/~3/fGbHAZF0phM/this-is-just-some-songs
http://feed.thisamericanlife.org/~r/talpodcast/~3/v4ChAcYYNOI/getting-out
http://feed.thisamericanlife.org/~r/talpodcast/~3/oY0USTfxlQg/essential


## Parsing XML using lxml

In [46]:
soup = BeautifulSoup(xml_content, 'lxml-xml')

In [47]:
soup.find_all('link')

[<link>https://www.thisamericanlife.org</link>,
 <atom10:link href="http://feed.thisamericanlife.org/talpodcast" rel="self" type="application/rss+xml" xmlns:atom10="http://www.w3.org/2005/Atom"/>,
 <atom10:link href="http://pubsubhubbub.appspot.com/" rel="hub" xmlns:atom10="http://www.w3.org/2005/Atom"/>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/C5f3oKxcJY0/an-invitation-to-tea</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/lcq4nXWZcXo/audience-of-one</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/K2FV8ELmBHA/the-ferryman</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/DrsRiVRDOe0/recordings-for-someone</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/q8IuT3Ub28k/my-bad</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/_KDozKOQf4s/the-end-of-the-world-as-we-know-it</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/LPqZkwRXTAI/suitable-for-children</link>,
 <link>http://fee

In [48]:
soup.select("rss link")

[<link>https://www.thisamericanlife.org</link>,
 <atom10:link href="http://feed.thisamericanlife.org/talpodcast" rel="self" type="application/rss+xml" xmlns:atom10="http://www.w3.org/2005/Atom"/>,
 <atom10:link href="http://pubsubhubbub.appspot.com/" rel="hub" xmlns:atom10="http://www.w3.org/2005/Atom"/>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/C5f3oKxcJY0/an-invitation-to-tea</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/lcq4nXWZcXo/audience-of-one</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/K2FV8ELmBHA/the-ferryman</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/DrsRiVRDOe0/recordings-for-someone</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/q8IuT3Ub28k/my-bad</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/_KDozKOQf4s/the-end-of-the-world-as-we-know-it</link>,
 <link>http://feed.thisamericanlife.org/~r/talpodcast/~3/LPqZkwRXTAI/suitable-for-children</link>,
 <link>http://fee

To use lxml directly:

In [49]:
from io import BytesIO
parser = etree.XMLParser()
tree = etree.parse(BytesIO(xml_content), parser)

In [50]:
tree.getroot()

<Element rss at 0x7f5bbc8b6040>

In [51]:
tree.findall('.//link')

[<Element link at 0x7f5bbc8b7bc0>,
 <Element link at 0x7f5bbc8b7b80>,
 <Element link at 0x7f5bbc8b7c00>,
 <Element link at 0x7f5bbc8b7c80>,
 <Element link at 0x7f5bbc8b7cc0>,
 <Element link at 0x7f5bbc8b7d00>,
 <Element link at 0x7f5bbc8b7d40>,
 <Element link at 0x7f5bbc8b7d80>,
 <Element link at 0x7f5bbc8b7dc0>,
 <Element link at 0x7f5bbc8b7e00>,
 <Element link at 0x7f5bbc8b7e40>]

Open the [sgml-lab][sgml-lab]

[sgml-lab]: ./sgml-lab.ipynb