### Use an HTML Parser to Scrape Websites

Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that is explicitly
designed for parsing out HTML pages. There are many Python tools
written for this purpose, but the Beautiful Soup library is a good one
to start with.
To install Beautiful Soup, you can run the following in your terminal:
<br> $ pip3 install beautifulsoup4

In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html)

In [3]:
print(soup.get_text())




Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






Sometimes the HTML tags themselves are the elements that
point out the data you want to retrieve. For instance, perhaps you
want to retrieve the URLs for all the images on the page. These links
are contained in the src attribute of <img> HTML tags. In this case, you
can use the find_all() method to return a list of all instances of that
particular tag:

In [4]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

In [5]:
 image1, image2 = soup.find_all("img")

In [6]:
image1.name

'img'

In [7]:
image1["src"]


'/static/dionysus.jpg'

In [8]:
image2["src"]

'/static/grapes.png'

In [9]:
soup.title

<title>Profile: Dionysus</title>

In [10]:
soup.title.string

'Profile: Dionysus'

In [11]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

In [12]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [13]:
page.status_code

200

In [14]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [16]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [17]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [18]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As you can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

In [19]:
html = list(soup.children)[2]

In [20]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
body = list(html.children)[3]

In [22]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [24]:
p = list(body.children)[1]

In [25]:
p.get_text()

'Here is some simple content for this page.'

In [26]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [27]:
soup.find_all('p')[0].get_text()


'Here is some simple content for this page.'

In [28]:
soup.find('p')

<p>Here is some simple content for this page.</p>

In [29]:
html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p>
<p class="inner-text">
Second paragraph.
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
First outer paragraph.
</b>
</p>
<p class="outer-text">
<b>
Second outer paragraph.
</b>
</p>
</body>
</html>

SyntaxError: invalid syntax (<ipython-input-29-f6919eb99df3>, line 1)

In [31]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')

In [32]:
soup.find_all('p', class_='outer-text')


[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [33]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [34]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [35]:
soup.select("div p")


[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [36]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 57. North northeast wind 5 to 7 mph becoming calm. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 57. North northeast wind 5 to 7 mph becoming calm. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 57 °F
 </p>
</div>
