## What You'll Accomplish in this Notebook

Here's what you'll do in this notebook:
<ul>
    <li>learn about the structure of an html page</li>
    <li>be introduced the the BeautifulSoup package</li>
    <li>see how to parse html code with a toy example</li>
    <li>scrape data on Seventh Son beers from some saved html code</li>
    <li>download then scrape an actual webpage</li>
</ul>

In [1]:
# Our common data handling package
import pandas as pd

# HTML Scraping With `BeautifulSoup`

In this notebook we'll learn about how to scrape html files with <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a>. Try running the below code.

In [2]:
from bs4 import BeautifulSoup

If that did not work take a moment to install the package.

### Understanding the Structure of an HTML Page

`BeautifulSoup` takes in an html document and will 'parse' it for you so that you can extract the information you want. To best understand what that means let's look at a toy example of a webpage. To see what the snippet of html code below looks like in a web browser click here <a href="SampleHTML.html">SampleHTML.html</a>.

In [3]:
# This is an html chunk
# It has a head and a body, just like you
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

We can now use `BeautifulSoup` to parse this simple html chunk.

In [4]:
# Now we make a BeautifulSoup object our of the html code
# The first input is the html code
# The second input is how you want BeautifulSoup
# to parse the code
soup = BeautifulSoup(html_doc,'html.parser')

In [5]:
# Let's use the prettify method to make our html pretty and see what it has to say
# Ideally this is how someone writing pure html code would write their code
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Html files have a natural tree structure that we'll briefly cover now. Here's the tree of our sample html:

<img src = "html_tree.png" width = "500"></img>

Each level in the tree represents a 'generation' of the html code. for instance the body has 3 p children, the leftmost p has one b child. `BeautifulSoup` helps us traverse these trees to gather the data we want.

We'll now traverse this html sapling.

In [6]:
# Below are some examples of beautifulsoup methods and 
# attributes that help us better understand the structure 
# of html code

# What is the title of the page?
print(soup.title)
print()

# Notice we can also get the title like so
print(soup.head.title)
print()

# What if I just want the text from the title?
print(soup.title.text)
print()

# What html structure is the title's parent?
print(soup.title.parent.name)
print(soup.title.parent)
print()

# What is the first a of the html document?
print(soup.a)

# What is the first a's class?
print(soup.a['class'])
print()

# There are multiple a's can I find all of them?
print(soup.find_all('a'))
for a in soup.find_all('a'):
    print()
    print(a['class'], a.text)
    

<title>The Dormouse's story</title>

<title>The Dormouse's story</title>

The Dormouse's story

head
<head><title>The Dormouse's story</title></head>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
['sister']

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

['sister'] Elsie

['sister'] Lacie

['sister'] Tillie


#### Practice

In [7]:
# Now you practice!

# Find the first p of the document
# What is the first p's class? What string is in that p?


print(soup.p)

print(soup.p['class'])


<p class="title"><b>The Dormouse's story</b></p>
['title']


In [9]:
# For all of the a's in the document find their href

for a in soup.find_all('a'):
    print(a['href'])
    print()



http://example.com/elsie

http://example.com/lacie

http://example.com/tillie



Now we've got some experience let's move on to some slightly more advanced parsing.

### Now We're of Drinking Age

I've included in this repository some html code from an <a href = "https://untappd.com/home">Untappd</a> search. We can read in that file with the following code. I went to untappd, and found the <a href = "https://www.seventhsonbrewing.com">Seventh Son</a> page then clicked on their beer list and only saved the html code from the results. You can see the html file here <a href="SeventhSon.html">SeventhSon.html</a>.

In [10]:
# This will save the html file's code so we can parse it
seventh_son_beer_search = open("SeventhSon.html", 'r')

In [11]:
# You write code here to make a soup object of our code
# Sample Answer
soup = BeautifulSoup(seventh_son_beer_search,"html.parser")






In [12]:
# Look at the code using prettify
# Sample Answer
print(soup.prettify())





<div class="beer-container">
 <div class="beer-item" data-bid="382779">
  <a class="label" href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
   <img src="https://untappd.akamaized.net/site/beer_logos/beer-382779_814af_sm.jpeg"/>
  </a>
  <div class="beer-details">
   <p class="name">
    <a href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
     Humulus Nimbus
    </a>
   </p>
   <p class="style">
    Pale Ale - American
   </p>
   <p class="desc desc-half-382779">
    A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart…
    <a class="read-more-beerlist track-click" data-bid="382779" data-href=":readmorebeer" data-track="brewerylist" href="#">
     Read More
    </a>
   </p>
   <p class="desc desc-full-382779" style="display: none;">
    A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mo

As we can see from the `prettify()` output this html code is more complicated than our toy example from above, but `BeautifulSoup` is able to handle it all the same. Let's write some code to go through the html and grab the beer names and then store those names in a list.

To do that let's learn a little more about BeautifulSoup's functionality. Looking at the prettify output we see that each beer is contained in a "beer-item". We can use that class information to our advantage.

In [13]:
cooler = soup.find_all('div',{'class':"beer-item"})

In [14]:
beer0 = cooler[0]

In [15]:
beer0

<div class="beer-item" data-bid="382779">
<a class="label" href="/b/seventh-son-brewing-company-humulus-nimbus/382779">
<img src="https://untappd.akamaized.net/site/beer_logos/beer-382779_814af_sm.jpeg"/>
</a><div class="beer-details">
<p class="name"><a href="/b/seventh-son-brewing-company-humulus-nimbus/382779">Humulus Nimbus </a></p>
<p class="style">Pale Ale - American</p>
<p class="desc desc-half-382779">A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart… <a class="read-more-beerlist track-click" data-bid="382779" data-href=":readmorebeer" data-track="brewerylist" href="#">Read More</a> </p>
<p class="desc desc-full-382779" style="display: none;">A pale golden ale that is both super crisp and super hop forward with a refreshing mouthfeel and a summer friendly 6% abv. Mosaic &amp; simcoe hops lend tart blueberry and fragrant pine to a pleasingly bitter dandelion finish. We wan

In [16]:
# Now use what we just learned to extract the name from beer 0
# The name is contained in a p element with class "name"
print(beer0.find("p",{'class':"name"}))


<p class="name"><a href="/b/seventh-son-brewing-company-humulus-nimbus/382779">Humulus Nimbus </a></p>


In [17]:
# Now make a list of all the beer names
beers = [beer.find("p",{'class':"name"}).text for beer in cooler]






In [18]:
beers

['Humulus Nimbus ',
 'Proliferous',
 'The Scientist',
 'Seventh Son American Strong Ale',
 'Stone Fort Oat Brown',
 'Oubliette',
 'Syzygy',
 'Assistant Manager',
 'Goo Goo Muck',
 'Golden Ratio',
 'Gleen',
 'Lost Sparrow',
 'Fox In the Stout',
 'Wilderman',
 'Mr. Owl',
 'Ladies And Gentlemen',
 'Brother Jon',
 'Chester Copperpot',
 'Willowolf',
 'Rime',
 'Black Sheep',
 "Hadron's Collision Table Beer",
 'Laniakea',
 'Sun Mouth',
 'Abaddon',
 'The Wild Hunt',
 'Qahwah Arab Imperial Coffee Stout',
 'Lemongrass Wit',
 'Baphomet',
 'Tessera',
 'Prime Swarm',
 'Bibendum',
 'Toast',
 'Cloudbusting',
 '4x4 Smash Ale',
 'Plowshare',
 'Big Black Cow',
 'La Mort Saison',
 'The Odd Son',
 'Nonsense',
 'Ragana Yaga',
 'Tinkerton',
 'Scam Likely',
 'Sprig',
 'Jack In the Green',
 'Joe the Lion',
 'Ootheca',
 'The Cruelest Month',
 'Smooth Pursuit',
 'Goblin King',
 'Abbey Normaal',
 'Haymake',
 'Peach Blossom',
 'Caribbean Tonic',
 "Irene's Revenge",
 'Grave Blanket Dark Rye Ale',
 'International H

#### Practice

You've been hired by a competitor to SeventhSon. They want a dataframe of all of SeventhSon's beers that includes their name, beer type, abv, and ibu. Use BeautifulSoup to give them this.

In [19]:
# Your work here
beers_name = [beer.find("p",{'class':"name"}).text for beer in cooler]
beers_type = [beer.find("p",{'class':"style"}).text for beer in cooler]
beers_abv = [beer.find("p",{'class':"abv"}).text for beer in cooler]
beers_ibu = [beer.find("p",{'class':"ibu"}).text for beer in cooler]

beer_df = pd.DataFrame({'name':beers_name,'type':beers_type, 'abv':beers_abv, 'ibu':beers_ibu})

beer_df.head()

Unnamed: 0,name,type,abv,ibu
0,Humulus Nimbus,Pale Ale - American,\n6% ABV,\n53 IBU
1,Proliferous,IPA - Imperial / Double,\n8.3% ABV,\n85 IBU
2,The Scientist,IPA - American,\n7% ABV,\n75 IBU
3,Seventh Son American Strong Ale,Strong Ale - American,\n7.7% ABV,\n40 IBU
4,Stone Fort Oat Brown,Brown Ale - English,\n5.25% ABV,\n21 IBU


In [20]:
## lets do some processing to remove charecters from abv and ibu
def removeABV(string):
    string = string.lstrip('\n')
    string = string.rstrip('% ABV')
    return string

def removeIBU(string):
    string = string.lstrip('\n')
    string = string.rstrip(' IBU')
    return string
    

beer_df['abv'] = beer_df['abv'].apply(lambda x: removeABV(x))
beer_df['ibu'] = beer_df['ibu'].apply(lambda x: removeIBU(x))
beer_df.head()

Unnamed: 0,name,type,abv,ibu
0,Humulus Nimbus,Pale Ale - American,6.0,53
1,Proliferous,IPA - Imperial / Double,8.3,85
2,The Scientist,IPA - American,7.0,75
3,Seventh Son American Strong Ale,Strong Ale - American,7.7,40
4,Stone Fort Oat Brown,Brown Ale - English,5.25,21


In [22]:
# How many of each type of beer?


beer_df.type.value_counts()





IPA - American                          17
Saison / Farmhouse Ale                  11
IPA - Imperial / Double                  9
Pale Ale - American                      7
IPA - New England                        6
IPA - Brut                               4
Stout - American Imperial / Double       4
Smoked Beer                              4
Pale Ale - New England                   3
IPA - Black / Cascadian Dark Ale         3
Stout - Imperial / Double                3
Sour - Farmhouse IPA                     3
Blonde Ale - Belgian Blonde / Golden     2
Lager - Euro                             2
Barleywine - American                    2
Bière de Garde                           2
Lager - IPL (India Pale Lager)           2
Stout - American                         2
Sour - Fruited                           1
Strong Ale - American                    1
Pale Ale - Belgian                       1
Witbier                                  1
Brown Ale - American                     1
IPA - Belgi

Great! We're finally getting familiar enough with BeautifulSoup to move on to an actual website.

### Surfing the web

Here's our hypothetical project. Your hired by someone that wants to start a FiveThirtyEight like website, but hates writing. Their goal is to create a natural language bot that uses an NLP algorithm to generate new 538 like articles using previous 538 articles. They're too busy working on the algorithm so they've outsourced the job of scraping the article content to us. 

Their desired output is a compilation of 538's articles. The data they need is each article's title, author, and text.

Let's go through how to get the title, author, and text for one specific article.

In [23]:
# This method allows us to get the html code from 538
from urllib.request import urlopen

In [24]:
# Grab the html from the article
html = urlopen("https://fivethirtyeight.com/features/giannis-antetokounmpo-is-creating-more-than-ever/")

In [25]:
# Turn it into soup
soup = BeautifulSoup(html,"html.parser")

In [26]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Giannis Antetokounmpo Is Creating More Than Ever | FiveThirtyEight
  </title>
  <!-- Jetpack Site Verification Tags -->
  <link href="//cdn.registerdisney.go.com" rel="dns-prefetch"/>
  <link href="//platform.twitter.com" rel="dns-prefetch"/>
  <link href="//s.w.org" rel="dns-prefetch"/>
  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//securepubads.g.doubleclick.net" rel="dns-prefetch"/>
  <link href="//www.googletagmanager.com" rel="dns-prefetch"/>
  <link href="//www.googletagservices.com" rel="dns-prefetch"/>
  <link href="//www.googleadservices.com" rel="dns-prefetch"/>
  <link href="//adservice.google.com" rel="dns-prefetch"/>
  <link href="//www.google.com" rel="dns-prefetch"/>
  <link href="https://fivethirtyeight

We can see that going through the code might be annoying. Let's learn about the web developer tools of your web browser.


#### Your Turn

Now that we've reviewed how we can use the developer tools you try to write the code that will grab the desired info.

In [28]:
# Find the author here
author_name = soup.find("p",{'class':"single-metadata single-byline vcard"}).text

print(author_name)
print()

# Remove the By
print(author_name.split()[1],author_name.split()[2])






By Jared Dubin

Jared Dubin


In [29]:
# Find the title here

article_title = soup.find("h1",{'class':"article-title article-title-single entry-title"}).text

article_title.strip('\n\t')






'Giannis Antetokounmpo Is Creating More Than Ever'

In [34]:
# Find the text of the article here

article_content = soup.find('div',{'class':"entry-content single-post-content"})

article_paragraphs = [p.text for p in article_content.find_all("p")]

for i in range(len(article_paragraphs)):
    print(article_paragraphs[i])
    print()




When the Milwaukee Bucks drafted Giannis Antetokounmpo with the No. 15 overall pick in the 2013 NBA Draft, so little was known about the international man of mystery that draft reaction around the media landscape had two different spellings of his name: Antetokounmpo and Adetokunbo.1 Then 6-foot-9 after growing 3 inches in the year leading up to the draft, Antetokounmpo was rail-thin at 196 pounds, and it seemed that the Bucks had swung for the fences with one of the highest-variance picks in recent memory. He was compared to everyone from Thabo Sefolosha to Scottie Pippen and Nicolas Batum to Kevin Durant. Seriously.

Nearly seven years later, it’s difficult to argue that Antetokounmpo hasn’t maxed out his potential. But that doesn’t mean those post-draft prognostications were wrong. It did take Giannis a few years to make it clear that the Bucks had really hit a home run with his selection, and a few more for him to blossom into a full-fledged force capable of dominating every single

Now you've got a good introduction to web sraping with `BeautifulSoup`. If you want to dive deeper check out the documentation here, <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a>, or you can try searching the web for what you'd like to do.