## What You'll Accomplish in this Notebook

Here's what you'll do in this notebook:
<ul>
    <li>learn about the structure of an html page</li>
    <li>be introduced the the BeautifulSoup package</li>
    <li>see how to parse html code with a toy example</li>
    <li>scrape data on Seventh Son beers from some saved html code</li>
    <li>download then scrape an actual webpage</li>
</ul>

In [None]:
# Our common data handling package
import pandas as pd

# HTML Scraping With `BeautifulSoup`

In this notebook we'll learn about how to scrape html files with <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a>. Try running the below code.

In [None]:
from bs4 import BeautifulSoup

If that did not work take a moment to install the package.

### Understanding the Structure of an HTML Page

`BeautifulSoup` takes in an html document and will 'parse' it for you so that you can extract the information you want. To best understand what that means let's look at a toy example of a webpage. To see what the snippet of html code below looks like in a web browser click here <a href="SampleHTML.html">SampleHTML.html</a>.

In [None]:
# This is an html chunk
# It has a head and a body, just like you
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

We can now use `BeautifulSoup` to parse this simple html chunk.

In [None]:
# Now we make a BeautifulSoup object our of the html code
# The first input is the html code
# The second input is how you want BeautifulSoup
# to parse the code
soup = BeautifulSoup(html_doc,'html.parser')

In [None]:
# Let's use the prettify method to make our html pretty and see what it has to say
# Ideally this is how someone writing pure html code would write their code
print(soup.prettify())

Html files have a natural tree structure that we'll briefly cover now. Here's the tree of our sample html:

<img src = "html_tree.png" width = "500"></img>

Each level in the tree represents a 'generation' of the html code. for instance the body has 3 p children, the leftmost p has one b child. `BeautifulSoup` helps us traverse these trees to gather the data we want.

We'll now traverse this html sapling.

In [None]:
# Below are some examples of beautifulsoup methods and 
# attributes that help us better understand the structure 
# of html code

# What is the title of the page?
print(soup.title)
print()

# Notice we can also get the title like so
print(soup.head.title)
print()

# What if I just want the text from the title?
print(soup.title.text)
print()

# What html structure is the title's parent?
print(soup.title.parent.name)
print(soup.title.parent)
print()

# What is the first a of the html document?
print(soup.a)

# What is the first a's class?
print(soup.a['class'])
print()

# There are multiple a's can I find all of them?
print(soup.find_all('a'))
for a in soup.find_all('a'):
    print()
    print(a['class'], a.text)
    

#### Practice

In [None]:
# Now you practice!

# Find the first p of the document
# What is the first p's class? What string is in that p?







In [None]:
# For all of the a's in the document find their href









Now we've got some experience let's move on to some slightly more advanced parsing.

### Now We're of Drinking Age

I've included in this repository some html code from an <a href = "https://untappd.com/home">Untappd</a> search. We can read in that file with the following code. I went to untappd, and found the <a href = "https://www.seventhsonbrewing.com">Seventh Son</a> page then clicked on their beer list and only saved the html code from the results. You can see the html file here <a href="SeventhSon.html">SeventhSon.html</a>.

In [None]:
# This will save the html file's code so we can parse it
seventh_son_beer_search = open("SeventhSon.html", 'r')

In [None]:
# You write code here to make a soup object of our code
# Sample Answer
soup = BeautifulSoup(seventh_son_beer_search,"html.parser")






In [None]:
# Look at the code using prettify
# Sample Answer
print(soup.prettify())



As we can see from the `prettify()` output this html code is more complicated than our toy example from above, but `BeautifulSoup` is able to handle it all the same. Let's write some code to go through the html and grab the beer names and then store those names in a list.

To do that let's learn a little more about BeautifulSoup's functionality. Looking at the prettify output we see that each beer is contained in a "beer-item". We can use that class information to our advantage.

In [None]:
cooler = soup.find_all('div',{'class':"beer-item"})

In [None]:
beer0 = cooler[0]

In [None]:
beer0

In [None]:
# Now use what we just learned to extract the name from beer 0
# The name is contained in a p element with class "name"
print(beer0.find("p",{'class':"name"}))






In [None]:
# Now make a list of all the beer names
beers = [beer.find("p",{'class':"name"}).text for beer in cooler]






In [None]:
beers

#### Practice

You've been hired by a competitor to SeventhSon. They want a dataframe of all of SeventhSon's beers that includes their name, beer type, abv, and ibu. Use BeautifulSoup to give them this.

In [None]:
# Your work here










In [None]:
# How many of each type of beer?










Great! We're finally getting familiar enough with BeautifulSoup to move on to an actual website.

### Surfing the web

Here's our hypothetical project. Your hired by someone that wants to start a FiveThirtyEight like website, but hates writing. Their goal is to create a natural language bot that uses an NLP algorithm to generate new 538 like articles using previous 538 articles. They're too busy working on the algorithm so they've outsourced the job of scraping the article content to us. 

Their desired output is a compilation of 538's articles. The data they need is each article's title, author, and text.

Let's go through how to get the title, author, and text for one specific article.

In [None]:
# This method allows us to get the html code from 538
from urllib.request import urlopen

In [None]:
# Grab the html from the article
html = urlopen("https://fivethirtyeight.com/features/giannis-antetokounmpo-is-creating-more-than-ever/")

In [None]:
# Turn it into soup
soup = BeautifulSoup(html,"html.parser")

In [None]:
print(soup.prettify())

We can see that going through the code might be annoying. Let's learn about the web developer tools of your web browser.


#### Your Turn

Now that we've reviewed how we can use the developer tools you try to write the code that will grab the desired info.

In [None]:
# Find the author here







In [None]:
# Find the title here







In [None]:
# Find the text of the article here







Now you've got a good introduction to web sraping with `BeautifulSoup`. If you want to dive deeper check out the documentation here, <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">https://www.crummy.com/software/BeautifulSoup/bs4/doc/</a>, or you can try searching the web for what you'd like to do.