# Introduction to BeautifulSoup

BeautifulSoup is a powerful HTML parser, used to extract information from an HTML page. BeautifulSoup itself does not download the web (HTML) pages. We need to use other tools, such as the `urllib.request` package to connect and download web pages, then feed the HTML to BeautifulSoup.

## Parsing a simple HTML

Suppose we have an HTML page, stored as a string "html":

In [1]:
html = """
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

<p>My Second paragraph.</p>

</body>
</html>
"""

We can load the string into BeautifulSoup:

In [2]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, "html5lib")

Now we can access the HTML elements easily:

In [33]:
bs.html.body.h1

<h1>War and Peace</h1>

In [4]:
# Pretty print -- showing the hierarchical structure of the HTML

print(bs.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My first paragraph.
  </p>
  <p>
   My Second paragraph.
  </p>
 </body>
</html>


In [5]:
bs.html.head

<head>
<title>Page Title</title>
</head>

In [6]:
bs.html.head.title

<title>Page Title</title>

In [7]:
type(bs.html.head.title)

bs4.element.Tag

In [8]:
bs.html.head.title.get_text()

'Page Title'

In [9]:
bs.find_all("p")

[<p>My first paragraph.</p>, <p>My Second paragraph.</p>]

In [10]:
# this only shows the first occurance (if there are multiple ones)
bs.html.body.p

<p>My first paragraph.</p>

## Tags and attributes

In [11]:
# example copied from the internet

html = """
<html>
<head>
<style>
.green{ 
    color:#008000;
}
.red{
    color:##FF0000;
}
#text{
    width:70%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. </span> 
But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."
<p/>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first to arrive at her
reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
<span class="green">St. Petersburg</span>, used only by the elite.
<p/>
"""

In [12]:
bs = BeautifulSoup(html, "html5lib")

# specify only tag

bs.find_all('span')

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. </span>, <span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>]

In [13]:
# Specifying both tag and attributes

bs.find_all('span', {"class": "green"})

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>]

In [14]:
# extract the text from the tags
for x in bs.find_all('span', {"class": "green"}):
    print(x.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


In [37]:
# extract multiple attributes
for x in bs.find_all("span", {"class" : ["green", "red"]}):
    print(x.get_text())

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. 
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


## Connect to a web page

We use `urllib.request` to establish connection to a web page.

In [16]:
import urllib.request

url = 'http://www.ucla.edu'
page = urllib.request.urlopen(url)
html = page.read()
soup = BeautifulSoup(html, "html5lib")

This works. But before proceeding, let's inspect the objects we created a bit more carefully.

In [17]:
type(soup)

bs4.BeautifulSoup

The data read from the URL object is of type "bytes":

In [18]:
type(html)

bytes

We could find out the char set and then decode the contents into string as follows:

In [19]:
charset = page.info().get_content_charset()
charset

'utf-8'

In [20]:
page = urllib.request.urlopen(url)
html = page.read().decode(charset)
type(html)

str

In [21]:
html



In [22]:
soup = BeautifulSoup(html, "html5lib")

In [23]:
# find the first occurance of "li"

soup.find('li')

<li><a href="/students/prospective-students">Prospective Students</a></li>

In [24]:
# check the type

type(soup.find('li'))

bs4.element.Tag

In [25]:
# find all tags "li" as a list

x = soup.find_all('li')
x

[<li><a href="/students/prospective-students">Prospective Students</a></li>,
 <li><a href="/students/current-students">Current Students</a></li>,
 <li><a href="/faculty">Faculty</a></li>,
 <li><a href="/staff">Staff</a></li>,
 <li><a href="/alumni">Alumni</a></li>,
 <li><a href="/parents-and-families">Parents &amp; Families</a></li>,
 <li><a class="dropdown-link" href="/about">ABOUT</a><ul aria-live="polite" class="dropdown-wrapper" role="region">
 								<li><ul class="nav-column"><li><a href="/about/">Overview</a></li><li><a href="/about/chancellor">Chancellor</a></li><li><a href="/about/leadership">Leadership</a></li></ul><ul class="nav-column"><li><a href="/about/mission-and-values">Mission &amp; Values</a></li><li><a href="/about/facts-and-figures">Facts &amp; Figures</a></li><li><a href="/about/awards-and-honors">Awards &amp; Honors</a></li></ul><ul class="nav-column"><li><a href="/about/history">History</a></li><li><a href="/about/impact-and-accomplishments">Impact &amp; Accomp

## Some web sites require a nicer connection

In [34]:
# the URL of a book

url = 'https://www.amazon.com/dp/149190142X/ref=cm_sw_r_cp_ep_dp_N-YZzb1A4XTTR'

In [35]:
from urllib.error import URLError
try:
    page = urllib.request.urlopen(url)
except URLError as e:
    print(e)

In [28]:
# We need to "pretend" that we are coming from a browser (user-agent)

agent = 'Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:1.7.9) Gecko/20050711 Firefox/1.0.5'
req = urllib.request.Request(url, headers = {'User-Agent': agent})

In [29]:
page = urllib.request.urlopen(req)
html= page.read()
soup = BeautifulSoup(html, "html5lib")

In [30]:
soup.title

<title>Data Science from Scratch: First Principles with Python: Joel Grus: 9781491901427: Amazon.com: Books</title>

In [31]:
soup.find_all('span', {'class': 'a-size-base mediaTab_subtitle'})[1].get_text().strip()

'$20.50'

If the web site has stronger protection mechanism, you may have to feed random user agents at different times.

In [32]:
import random
def get_user_agent():
    """
      Return a random user agent from the list
    """
    db = [ 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5',
            'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; ko; rv:1.9.1b2) Gecko/20081201 Firefox/3.1b2',
            'Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5',
            'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.12) Gecko/20080214 Firefox/2.0.0.12',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; cs; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8',
            'Mozilla/5.0 (X11; U; OpenBSD i386; en-US; rv:1.8.0.5) Gecko/20060819 Firefox/1.5.0.5',
            'Mozilla/5.0 (Windows; U; Windows NT 5.0; es-ES; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3',
            'Mozilla/5.0 (Windows; U; WinNT4.0; en-US; rv:1.7.9) Gecko/20050711 Firefox/1.0.5'
        ]
    idx = random.randrange(0, len(db))
    return db[idx]