# Parsing HTML and XML

In [1]:
import requests

In [2]:
resp_h5 = requests.get('https://www.google.com')
resp_h4 = requests.get('http://www.duckduckgo.com', allow_redirects=False)   # 301 redirect

In [3]:
html5_content = resp_h5.content
html4_content = resp_h4.content

In [4]:
html4_content

b'<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor="white">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

## Parsing HTML using HTMLParser

In [5]:
from html.parser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag: {} (attrs {})".format(tag, attrs))

    def handle_endtag(self, tag):
        print("Encountered a end tag: {}".format(tag))

    def handle_data(self, data):
        '''Do nothing here'''

In [6]:
parser = MyHTMLParser()
parser.feed(str(html4_content))

Encountered a start tag: html (attrs [])
Encountered a start tag: head (attrs [])
Encountered a start tag: title (attrs [])
Encountered a end tag: title
Encountered a end tag: head
Encountered a start tag: body (attrs [('bgcolor', 'white')])
Encountered a start tag: center (attrs [])
Encountered a start tag: h1 (attrs [])
Encountered a end tag: h1
Encountered a end tag: center
Encountered a start tag: hr (attrs [])
Encountered a start tag: center (attrs [])
Encountered a end tag: center
Encountered a end tag: body
Encountered a end tag: html


In [7]:
parser.feed(str(html5_content))

Encountered a start tag: html (attrs [('itemscope', ''), ('itemtype', 'http://schema.org/WebPage'), ('lang', 'en')])
Encountered a start tag: head (attrs [])
Encountered a start tag: meta (attrs [('content', "Search the world\\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\\'re looking for."), ('name', 'description')])
Encountered a start tag: meta (attrs [('content', 'noodp'), ('name', 'robots')])
Encountered a start tag: meta (attrs [('content', 'text/html; charset=UTF-8'), ('http-equiv', 'Content-Type')])
Encountered a start tag: meta (attrs [('content', '/logos/doodles/2019/celebrating-johann-sebastian-bach-5702425880035328.3-l.png'), ('itemprop', 'image')])
Encountered a start tag: meta (attrs [('content', 'Celebrating Johann Sebastian Bach'), ('property', 'twitter:title')])
Encountered a start tag: meta (attrs [('content', 'Compose your own Bach-inspired tunes with the first ever AI-powered #GoogleDo

## Parsing (bad) HTML using BeautifulSoup

In [8]:
!pip install -U beautifulsoup4

Requirement already up-to-date: beautifulsoup4 in /Users/rick446/.virtualenvs/advanced-python/lib/python3.7/site-packages (4.7.1)
[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html4_content, 'html.parser')

In [11]:
soup.find('head')

<head><title>301 Moved Permanently</title></head>

In [12]:
soup.findAll('center')

[<center><h1>301 Moved Permanently</h1></center>, <center>nginx</center>]

In [13]:
!pip install html5lib

[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [14]:
soup = BeautifulSoup(html5_content, 'html5lib')

In [15]:
soup.findAll('div')

[<div id="mngb"> <div id="gbar"><nobr><b class="gb1">Search</b> <a class="gb1" href="https://www.google.com/imghp?hl=en&amp;tab=wi">Images</a> <a class="gb1" href="https://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a> <a class="gb1" href="https://www.youtube.com/?gl=US&amp;tab=w1">YouTube</a> <a class="gb1" href="https://news.google.com/nwshp?hl=en&amp;tab=wn">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a> <a class="gb1" href="https://www.google.com/intl/en/about/products?tab=wh" style="text-decoration:none"><u>More</u> »</a></nobr></div><div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a> | <a class="gb4" href="/preferences?hl=en">Settings</a> | <a class="gb4" href="

In [16]:
soup.find_all(attrs={'id': 'footer'})

[<span id="footer"><div style="font-size:10pt"><div id="fll" style="margin:19px auto;text-align:center"><a href="/intl/en/ads/">Advertising�Programs</a><a href="/services/">Business Solutions</a><a href="/intl/en/about.html">About Google</a></div></div><p style="color:#767676;font-size:8pt">© 2019 - <a href="/intl/en/policies/privacy/">Privacy</a> - <a href="/intl/en/policies/terms/">Terms</a></p></span>]

In [17]:
soup.title

<title>Google</title>

In [18]:
links = (link.get('href') for link in soup.find_all('a'))
links = set(link for link in links if link)
sorted(list(links))[:10]

['/advanced_search?hl=en&authuser=0',
 '/intl/en/about.html',
 '/intl/en/ads/',
 '/intl/en/policies/privacy/',
 '/intl/en/policies/terms/',
 '/language_tools?hl=en&authuser=0',
 '/preferences?hl=en',
 '/search?site=&ie=UTF-8&q=Johann+Sebastian+Bach&oi=ddle&ct=celebrating-johann-sebastian-bach-5702425880035328&hl=en&kgmid=/m/03_f0&sa=X&ved=0ahUKEwjTrKDQoJThAhWaHTQIHWKVCrEQPQgD',
 '/services/',
 'http://www.google.com/history/optout?hl=en']

## Parsing HTML using lxml

`lxml` is a (very) fast HTML and XML parser. Unfortunately, it is sometimes difficult to install. Here are the [installation instructions](http://lxml.de/installation.html) for the intrepid.

If you're feeling lucky and have a C compiler installed, you can try:

```
STATIC_DEPS=true sudo pip install lxml
```

In [19]:
!pip install lxml

[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [20]:
# Easy: use BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html4_content, 'lxml')

In [21]:
soup.title

<title>301 Moved Permanently</title>

In [22]:
soup = BeautifulSoup(html5_content, 'lxml')

In [23]:
soup.title

<title>Google</title>

You can also use lxml directly:

In [24]:
from lxml import etree
from io import StringIO

In [25]:
parser = etree.HTMLParser()
tree = etree.parse(StringIO(str(html5_content)), parser)

In [26]:
root = tree.getroot()

In [27]:
root.tag

'html'

In [28]:
list(root)

[<Element body at 0x109b9ce48>]

In [29]:
for link in tree.findall('.//a')[:10]:
    print(link.attrib['href'])

https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en


## Parsing XML using SAX

In [30]:
xml_content = requests.get('http://www.developintelligence.com/blog/feed/').content

In [31]:
import xml.sax

In [32]:
class RSSHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.in_link = False
        self.link = None
        xml.sax.ContentHandler.__init__(self)
    def startElement(self, name, attrs):
        if name == 'link':
            self.in_link = True
            self.link = ''
    def endElement(self, name):
        if name == 'link':
            self.in_link = False
            print(self.link.strip())
            self.link = None
    def characters(self, content):
        if self.in_link:
            self.link += content

In [33]:
parser = xml.sax.make_parser()
parser.setContentHandler(RSSHandler())
parser.parse(StringIO(str(xml_content, 'utf-8')))

http://www.developintelligence.com/blog
http://www.developintelligence.com/blog/2018/07/insert-remove-splice-and-replace-elements-with-array-splice/
http://www.developintelligence.com/blog/2018/02/reflections-reluctant-servant-leader/
http://www.developintelligence.com/blog/2017/12/10-myths-professional-training/
http://www.developintelligence.com/blog/2017/12/feedback-teams-learn-fast/
http://www.developintelligence.com/blog/2017/11/reasons-shouldnt-ignore-automation/
http://www.developintelligence.com/blog/2017/11/why-test-coverage-shouldnt-trust/
http://www.developintelligence.com/blog/2017/11/2-surprising-skills-need-learning-team/
http://www.developintelligence.com/blog/2017/11/devops-department-need-attention/
http://www.developintelligence.com/blog/2017/10/look-learning-2018/
http://www.developintelligence.com/blog/2017/10/job-specification-improve-hinder-performance/


## Parsing XML using minidom

In [34]:
import xml.dom.minidom

In [35]:
dom = xml.dom.minidom.parseString(xml_content)

In [36]:
dom.childNodes

[<DOM Element: rss at 0x109e3d470>]

In [37]:
dom.childNodes[0].childNodes[1].childNodes[:10]

[<DOM Text node "'\n\t'">,
 <DOM Element: title at 0x109e3d638>,
 <DOM Text node "'\n\t'">,
 <DOM Element: atom:link at 0x109e3d6d0>,
 <DOM Text node "'\n\t'">,
 <DOM Element: link at 0x109e3d768>,
 <DOM Text node "'\n\t'">,
 <DOM Element: description at 0x109e3d800>,
 <DOM Text node "'\n\t'">,
 <DOM Element: lastBuildDate at 0x109e3d898>]

## Parsing XML using ElementTree

In [38]:
from xml.etree import ElementTree

In [39]:
tree = ElementTree.parse(StringIO(str(xml_content, 'utf-8')))

In [40]:
tree.getroot()

<Element 'rss' at 0x109eebdb8>

In [41]:
for link in tree.findall('.//link'):
    print(link.text)

http://www.developintelligence.com/blog
http://www.developintelligence.com/blog/2018/07/insert-remove-splice-and-replace-elements-with-array-splice/
http://www.developintelligence.com/blog/2018/02/reflections-reluctant-servant-leader/
http://www.developintelligence.com/blog/2017/12/10-myths-professional-training/
http://www.developintelligence.com/blog/2017/12/feedback-teams-learn-fast/
http://www.developintelligence.com/blog/2017/11/reasons-shouldnt-ignore-automation/
http://www.developintelligence.com/blog/2017/11/why-test-coverage-shouldnt-trust/
http://www.developintelligence.com/blog/2017/11/2-surprising-skills-need-learning-team/
http://www.developintelligence.com/blog/2017/11/devops-department-need-attention/
http://www.developintelligence.com/blog/2017/10/look-learning-2018/
http://www.developintelligence.com/blog/2017/10/job-specification-improve-hinder-performance/


## Parsing XML using lxml

In [42]:
soup = BeautifulSoup(xml_content, 'lxml-xml')

In [43]:
soup.find_all('link')

[<atom:link href="http://www.developintelligence.com/blog/feed/" rel="self" type="application/rss+xml"/>,
 <link>http://www.developintelligence.com/blog</link>,
 <link>http://www.developintelligence.com/blog/2018/07/insert-remove-splice-and-replace-elements-with-array-splice/</link>,
 <link>http://www.developintelligence.com/blog/2018/02/reflections-reluctant-servant-leader/</link>,
 <link>http://www.developintelligence.com/blog/2017/12/10-myths-professional-training/</link>,
 <link>http://www.developintelligence.com/blog/2017/12/feedback-teams-learn-fast/</link>,
 <link>http://www.developintelligence.com/blog/2017/11/reasons-shouldnt-ignore-automation/</link>,
 <link>http://www.developintelligence.com/blog/2017/11/why-test-coverage-shouldnt-trust/</link>,
 <link>http://www.developintelligence.com/blog/2017/11/2-surprising-skills-need-learning-team/</link>,
 <link>http://www.developintelligence.com/blog/2017/11/devops-department-need-attention/</link>,
 <link>http://www.developintellig

In [44]:
from io import BytesIO
parser = etree.XMLParser()
tree = etree.parse(BytesIO(xml_content), parser)

In [45]:
tree.getroot()

<Element rss at 0x109fc7148>

In [46]:
tree.findall('.//link')

[<Element link at 0x109fb1c48>,
 <Element link at 0x109fc7a08>,
 <Element link at 0x109fc7ac8>,
 <Element link at 0x109fc7a48>,
 <Element link at 0x109fc7b08>,
 <Element link at 0x109fc7b48>,
 <Element link at 0x109fc7bc8>,
 <Element link at 0x109fc7c08>,
 <Element link at 0x109fc7c48>,
 <Element link at 0x109fc7c88>,
 <Element link at 0x109fc7cc8>]