<a href="https://colab.research.google.com/github/samifriedrich/webscraping_workshop/blob/main/solutions_to_webscraping_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping in python: From HTML Soup to Tidy Data

Thursday, Nov 19, 2020

[BioData Club](https://biodata-club.github.io/) Workshop Series

- Author: Sami Friedrich
- Created: Nov 11, 2020
- Updated Nov 19, 2020
- Libraries used:
  - `requests`
  - `BeautifulSoup4`
  - `pandas`
- Additional tools used:
  - the browser Inspector/Inspect tool


## Workshop overview
The internet is overflowing with data ripe for harvesting. The challenge is that not all of that data is formatted neatly or easily accessible. Enter the web scraping multitool! With the power of web scraping, the contents of virtually any webpage can be transformed into analysis-ready data. During this workshop, you’ll learn using python how to:
 
1. Scavenge the contents of an HTML webpage
2. Extract only the data you want
3. Format the data into a table

### Before we get started
1. Some basic python knowledge (looping through list elements, passing arguments to functions, writing basic functions) is a prerequisite for this workshop. 
  - If you are new to python or want to brush up on these topics before the workshop, check out these free tutorials:
    - http://introtopython.org/introducing_functions.html
    - http://introtopython.org/lists_tuples.html#Lists-and-Looping
2. We will also be working with HTML, and no prior experience is necessary.  However, it will be helpful to have a surface-level understanding of HTML elements - namely, their open/close tag structure, and how they nest within each other.
  - If you are not familiar with HTML elements or tags, please take a look at [this short overview on HTML Basics](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) before beginning.

## Intro to Jupyter Notebooks
We'll be working through exercises in a [Jupyter Notebook](https://jupyter.org/) hosted on Google CoLabs. A Jupyter Notebook is a coding environment that contains two types of cells - cells with narrative/explanation, like this one you're reading now, and cells with executable code, like the next cell down. Run the code in the cell below by clicking on the cell and pressing the "play" button to the left of the cell, or typing `SHIFT+ENTER`.

In [55]:
# This is a comment and will not affect the output
# Run this code cell
message = "Welcome to web scraping"
print(message)

Welcome to web scraping


The output of a code cell is displayed below it - in this case, the output is the printed string `Welcome to web scraping`. 

### A note on printing vs displaying in Notebooks
Usually in python environments you need to use the `print()` function to see the contents of a variable. One nice feature of Jupyter Notebooks is a shortcut where you simply type the name of the variable to display it below the code cell. 

### Exercise: Display a variable in a Jupyter Notebook
In the cell below, type the variable `message` on the second line and run the cell.

In [56]:
message = "Welcome to web scraping"
# Write your code here
message

'Welcome to web scraping'

That's all you need to know about notebooks for today, but head over to [Jupyter Project](https://jupyter.org/) to learn more.

# Step 1: Loading the webpage HTML into python

## 1.1 Sending a GET request

First, we need to get a webpage's HTML into our python environment. We can do that using the `requests` library.

Whenever you type a URL into your browser bar or click a link, your browser makes what is called a `GET` request to the web server hosting the webpage. What your browser receives back from the server is the information and resources to load that webpage, including the HTML, which it then renders in your browser window.

We can use the `requests` library to send `GET` requests from within python.

### Exercise 1.1.1: Make a GET request using the `requests` library
Run the code block below by hitting the "Run" button, or pressing SHIFT+RETURN.

In [57]:
# Run this code cell
import requests
url = "http://dataquestio.github.io/web-scraping-pages/simple.html"
page = requests.get(url)
page

<Response [200]>

What we get back is a Response object, which when printed displays a `status_code` that tells us how our request was handled. A status code of 200 means everything went smoothly and the webpage data was received. 

You're probably familiar with error codes like "404" - the "not found error." We won't get into these other codes today but you can read more about them [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

### Exercise 1.1.2: Extract the HTML from the `page` object
Take a look at the methods available to the `page` variable by typing `page.` and scrolling through the list in the pop-up box. **Pay particular attention to the methods with the blue box symbol.** Try running one of these on the `page` variable. Experiment until you find a method that outputs something that looks like the page's HTML, and assign the output to a variable named `contents`.

*Hint: The HTML should have the word "doctype" towards the top*

In [58]:
# Experiment here by trying out different methods on the page object
# Write your code here
contents = page.content
contents

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

Now we have the tools we need from the `requests` library to retrieve a webpage, extract the HTML, and save it as a variable in our python environment.

## 1.2 Our business case: "One of your favorite beers is on tap!"

Let's pretend you're on a developer team that is working to build an app for beer enthusiasts. The team wants to design a feature that sends a notification to the user whenever a beer from the user's favorite beer list pops up on tap around town. It's your job as the PDX Data Wrangler to gather data about what's on tap at bars around Portland.

To achieve this goal, the data we need to capture for each bar is:
- the name of the beers on tap
- the name of the breweries for those beers

We'll start by choosing one bar to build our scraper around: Belmont Station. 

It just so happens that Belmont Station uses a service, TapHunter, to display their rotating tap list on their [bar's website](https://www.belmont-station.com/on-tap). This tap list is also available on [TapHunter](https://www.taphunter.com/location/belmont-station/6549318358269952), as are tap lists from other bars. 

We could scrape either the bar's page, or the TapHunter page. The benefit of designing our web scraper around the TapHunter page is that we're able to **reuse our code to scrape data for multiple bars** from TapHunter. 


### Exercise 1.2: Load in the HTML for Belmont Station's Taphunter page

To avoid overloading the web servers over at TapHunter and ensure we're all looking at the exact same HTML, I downloaded the HTML of Belmont Station's TapHunter page this morning (Nov 19), then uploaded that .html file to a GitHub repository. Because this .html file is still a webpage (now hosted on GitHub servers), we can use the same approach we did above to retrieve its HTML. 

Using the `requests` library method we learned above, retrieve the webpage from the URL below, and save the HTML content to a variable named `contents`. Display the `contents` variable by typing it on the last line of your code block, then run the code.

In [59]:
# Retrieve the web content from the url below and save the HTML to the variable "contents"
url = "https://raw.githubusercontent.com/samifriedrich/webscraping_workshop/main/taphunter_belmont_station.html"
# Write your code here
page = requests.get(url)
contents = page.content
contents

b'b\'\\n\\n\\n\\n\\n\\n<!DOCTYPE html>\\n<html lang="en"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# website: http://ogp.me/ns/website#"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1"><meta property="fb:app_id" content="132562649599" /><meta name="twitter:app:country" content="US" /><meta name="twitter:app:name:iphone" content="TapHunter - Find Beer, Spirits, & More" /><meta name="twitter:app:id:iphone" content="512023104" /><meta name="twitter:app:name:googleplay" content="TapHunter" /><meta name="twitter:app:id:googleplay" content="com.taphunter.webbased" /><meta itemprop="market" content="annarbor" /><meta name="description" content="Live drink menu of Belmont Station - including Russian River Pliny The Elder, 2 Towns Ciderhouse Easy Squeezy, and Barley Brown&#39;s Pallet Jack IPA"><link rel="canonical" href="https://www.taphunter.com/location/belmont-station/654931835826

As it stands, the `contents` variable we created is one, long, unweildy stream of characters with the type `str` (string). We could try to parse it as is, but why brute force it when there are pre-built handy tools to help us make sense of this soup?

Enter the `BeautifulSoup` library!

## 1.3 Time to make soup

Our next step is to transform the long unwieldy HTML string contained in our `contents` variable into a `BeautifulSoup` object. 

We first import the `BeautifulSoup` data structure from the bs4 (BeautifulSoup4) library. Run the code block below to do this.

In [60]:
# Run this code cell
from bs4 import BeautifulSoup

### Upgrading `contents` from string to soup

With the BeautifulSoup library loaded, next we "pour" our `contents` variable into a BeautifulSoup object, thus creating a new variable. Under the hood, BeautifulSoup will parse the `contents` variable, devising its structure based on rules of HTML, and making it a much more accessible and interactive object.

Though it may seem confusing at first, BeautifulSoup is the name of a web scraping library but it is also the name of a function, `BeautifulSoup()`, within that library. 

You can think of the `BeautifulSoup()` constructor as a function that puts a wrapper around whatever object we pass it. Our original `contents` string will remain intact, but it will be transformed into a BeautifulSoup object that is decorated and **supercharged with methods** beyond what base python offers us. These are the benefits of upgrading `contents` from a `str` type object to a `BeautifulSoup` type object.

We create a BeautifulSoup object by passing `contents` to the `BeautifulSoup()` constructor function, and saving the output to a new variable.

Because BeautifulSoup can parse other kinds of documents, such as XML, we also need to pass in one additional argument, the string `"html.parser"`, to tell `BeautifulSoup()` to parse our string as HTML.

Here's an example of how to use the `BeautifulSoup()` constructor:
```python
bs_object = BeautifulSoup(html_var, "html.parser")
```
### Exercise 1.3: Creating a BeautifulSoup object

Call the `BeautifulSoup()` constructor function on the `contents` variable, and pass the string `"html.parser"` as the second argument to the function. Save this to a variable named `soup`, and display `soup`.

In [61]:
# Transform the contents variable into a BeautifulSoup object 
# Write your code here
soup = BeautifulSoup(contents, "html.parser")
soup

b'\n\n\n\n\n\n<!DOCTYPE html>
\n<html lang="en"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# website: http://ogp.me/ns/website#"><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="132562649599" property="fb:app_id"><meta content="US" name="twitter:app:country"><meta content="TapHunter - Find Beer, Spirits, &amp; More" name="twitter:app:name:iphone"><meta content="512023104" name="twitter:app:id:iphone"/><meta content="TapHunter" name="twitter:app:name:googleplay"/><meta content="com.taphunter.webbased" name="twitter:app:id:googleplay"/><meta content="annarbor" itemprop="market"/><meta content="Live drink menu of Belmont Station - including Russian River Pliny The Elder, 2 Towns Ciderhouse Easy Squeezy, and Barley Brown's Pallet Jack IPA" name="description"/><link href="https://www.taphunter.com/location/belmont-station/6549318358269952" rel="canonical"/><meta co

## 1.4 Printing pretty with `.prettify()`

With our new, supercharged BeautifulSoup object, `soup`, we can use methods that didn't exist for the `contents` string-type variable.

One great feature of BeautifulSoup object is its `.prettify()` method, which formats a BeautifulSoup object with indentation, making it easier to read and see how HTML elements are nested. 

### Exercise 1.4: Print some pretty soup
Try using the `.prettify()` method on our `soup` BeautifulSoup object. 

This method **requires open and closed parentheses** after it.

To see the effects of this method, the "prettified" object **must be printed**.

In [62]:
# Use the .prettify method on soup and print the result
# Write your code here
print(soup.prettify())

b'\n\n\n\n\n\n
<!DOCTYPE html>
\n
<html lang="en">
 <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# website: http://ogp.me/ns/website#">
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="132562649599" property="fb:app_id">
   <meta content="US" name="twitter:app:country">
    <meta content="TapHunter - Find Beer, Spirits, &amp; More" name="twitter:app:name:iphone">
     <meta content="512023104" name="twitter:app:id:iphone"/>
     <meta content="TapHunter" name="twitter:app:name:googleplay"/>
     <meta content="com.taphunter.webbased" name="twitter:app:id:googleplay"/>
     <meta content="annarbor" itemprop="market"/>
     <meta content="Live drink menu of Belmont Station - including Russian River Pliny The Elder, 2 Towns Ciderhouse Easy Squeezy, and Barley Brown's Pallet Jack IPA" name="description"/>
     <link href="https://www.taphunter.com/locatio

# Step 2: Straining the data from the soup

Congrats! We have successfully transferred a webpage's HTML into a pythonic BeautifulSoup object.

The next step is to strain the soup - that is, extract the data we're interested in. 

As we move through this section, keep in mind that there are multiple ways to go about extracting the data we want, not just the one outlined here. If another route makes more sense to you, do that!

There are lots of ways to navigate the BeautifulSoup object, including moving linearly through it one element at a time, much like iterating over every element in a list. There are also shortcuts to navigate to specific elements if we already have an idea of what we're looking for.

Today, we're going to focus on how to pull out specific elements using a few BeautifulSoup methods: `.find()` and `.find_all()`.

## 2.1 The CTRL+F of BeautifulSoup: `.find_all()`

`.find_all()` is probably the method you'll use most often when extracting data from a BeautifulSoup object. This method searches the BeautifulSoup object for an element or string, and returns the results as a ResultSet object. We can treat the ResultSet object just like a regular python list that has some bonus methods attached to it.

#### A note on `.find()` vs `.find_all()`:

BeautifulSoup also has a `.find()` method. The main difference is that `.find()` returns only the first search result as a single element, while `.find_all()` returns all results as a list of elements. 

#### Searching for HTML tags

The `.find_all()` method is quite flexible in terms of what it can search for, but one common way to use `.find_all()` is to extract all instances of a given HTML tag type (`<body>`, `<p>`, `<div>`, etc.). 

For example, this code would extract all the `div` headers:

```python
soup.find_all("div")
```
When using this approach, you do **not** need to include the `<>` characters - just pass in the tag label. 

### Exercise 2.1: Extracting all hyperlink tags
Use the `.find_all()` method to extract all the hyperlinks from our BeautifulSoup object `soup`. Assign the result to `hyperlinks` and display it. 

*Hint: Hyperlinks are denoted by `<a>` tags.*

In [63]:
# Extract all hyperlink tags and print the result
# Write your code here
hyperlinks = soup.find_all("a")
hyperlinks

[<a class="btn btn-primary gtm-link" href="/get-listed/start?promo_code=covid60">Get Started!</a>,
 <a class="navbar-brand" href="/location/"></a>,
 <a href="/search/?type=locations&amp;near=Ann%20Arbor%2C%20Michigan">Places</a>,
 <a href="/search/?type=beers&amp;near=Ann%20Arbor%2C%20Michigan">Beer</a>,
 <a href="/search/?type=breweries&amp;near=Ann%20Arbor%2C%20Michigan">Breweries</a>,
 <a href="/search/?type=wines&amp;near=Ann%20Arbor%2C%20Michigan">Wine</a>,
 <a href="/search/?type=spirits&amp;near=Ann%20Arbor%2C%20Michigan">Spirits</a>,
 <a href="/search/?type=cocktails&amp;near=Ann%20Arbor%2C%20Michigan">Cocktails</a>,
 <a href="/search/?type=events&amp;near=Ann%20Arbor%2C%20Michigan">Events</a>,
 <a href="/u/login/">Log In</a>,
 <a href="/u/signup/">Sign Up</a>,
 <a class="btn btn-block btn-primary gtm-location-event" data-gtm-label="orderonline" href="https://www.belmont-station.com/bottles" target="_blank"><span class="fa fa-shopping-cart"></span> \n\t\t\t\t\t\t\t\t\t\tOrder O

We can use this same `.find_all("tag")` method to extract all elements of any tag type.

## 2.2 Extracting tags with specific attributes or classes

HTML opening tags often contain more than just the tag name itself. These extra bits are called attributes and classes. 
- Attributes follow the tag name and often have an X=Y structure, e.g. `<h1 lang="en">`. 
- Classes follow the same structure and serve as a shorthand for multiple attributes defined elsewhere, e.g. `<h1 class="breaking-headline">`. 

I like to think of of tag classes as wrappers that can be applied to HTML tags to style the elements they contain.

In addition to passing in HTML tag types to `.find_all()`, we can also specify attributes or classes associated with tags to further filter our results.

For example, we may have 3 different flavors of `<div>` tags throughout our HTML:
- `<div class="stawberry">`
- `<div class="chocolate">`
- `<div class="pistacchio">`

We're only interested in the pistacchio `<div>` tags. If we run `.find_all('div')`, we will get back all the `<div>` tags, including the strawberry and chocolate ones. 

To return only the "pistacchio" class `<div>` tags, we pass an additional argument to `.find_all()` - the `class_` argument. 

#### **Notice the underscore in this `class_` parameter, which delineates it from `class`, which is a reserved word in base python.**

To get just the "pistacchio" `<div>` elements, we can run:
```python
soup.find_all("div", class_="pistacchio")
```

which would return only the `<div class="pistacchio">` tags.

### Exercise 2.2: Extract all hyperlinks of a specific class

Extract all hyperlink tags from `soup` that have the class "gtm-link", and assign to a variable named "a_gtm_link". Display the `a_gtm_link` variable.

In [64]:
# Extract all hyperlink elements of class "gtm-link" and save to variable a_gtm_link
# Write your code here
a_gtm_link = soup.find_all("a", class_="gtm-link")
a_gtm_link

[<a class="btn btn-primary gtm-link" href="/get-listed/start?promo_code=covid60">Get Started!</a>,
 <a class="btn btn-sm btn-primary gtm-link" href="/get-listed/start?promo_code=covid60">Get Started!</a>,
 <a class="gtm-link" href="https://www.evergreenhq.com/products/digital-drink-menu/">Digital Menus</a>,
 <a class="gtm-link" href="https://www.evergreenhq.com/products/print-menu/">Print Menus</a>,
 <a class="gtm-link" href="https://www.evergreenhq.com/products/inventory/">Inventory</a>,
 <a class="gtm-link" href="https://www.evergreenhq.com/products/social-media-tools/">Social Media</a>,
 <a class="gtm-link" href="https://www.evergreenhq.com/products/pos-integration/">POS Integration</a>,
 <a class="gtm-link" href="http://ad.apps.fm/hXanksBKeRFzCmsY1rO4gPE7og6fuV2oOMeOQdRqrE12gTZftEA26pyYTBQeK1ul15LvB3mAIVv-2blyDKgqKIV4EkBzhuiTQ6_P48Dm81I" target="_blank">Download on the AppStore</a>,
 <a class="gtm-link" href="http://market.android.com/details?id=com.taphunter.webbased" target="_bla

Notice that the first element has multiple classes: `class="btn btn-primary gtm-link"`, one of which matches our `class_` parameter "gtm-link." When you search for a tag that matches a certain class, **you’re matching against any of its classes.**

There are BeautifulSoup methods that let us search for elements that match multiple classes and much, much more, but we won't cover that today. If you're interested in this powerful method, check out the [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) on selecting CSS classes.

## 2.3 Locating target data within the HTML

Now we understand how to use the `.find_all()` method to extract specific elements based on their tag type and class. The next step is to apply this method to extract the data we want.

To extract our target data (the name of the beer and brewery), we first have to figure out **where that data lives in the webpage's HTML**. There are lots of ways to go about this. We could experiment with different searches using `.find_all()`, or we could open and `CTRL+F` the HTML text file in a word processor. 

Personally, I prefer to use an easy and interactive method that makes use of your browser's built-in **inspector** tool.

## The broswer inspector
A **browser inspector** is a tool included in most browsers that lets you interactively examine the HTML code underlying webpage elements. What makes using the inspector intuitive is that when you right click on a web page element and choose "Inspect", it highlights the HTML code corresponding to that element. The inspector will also live-update to highlight corresponding visual elements on the webpage as you mouse over the code. The inspector is an essential component of the web scraping multitool!

### Exercise 2.3 Exploring HTML using a browser inspector
Let's try it now on the TapHunter page for [Belmont Station](https://www.taphunter.com/location/belmont-station/6549318358269952). Open the page in your browser, right click on a beer or brewery, and select "Inspect" to open the inspector window. 

**NOTE:** Keep in mind the tap list may have changed since we downloaded it, but the overall layout of the webpage and HTML will be the same.

While we're poking around using the inspector, let's try to figure out which HTML tag types contain our target data. As a reminder, the data we want to capture is:
- the name of the beer
- the name of the brewery

Where do these data show up on the webpage? **Which HTML tag type contains *both* the name of the beer and the brewery name?** Are there attributes or classes that might help us target those tags for extraction using `.find_all()`?

*Hint: Repeated visual elements will share repeated HTML code structure.*

*Hint: Look for a `<div>` with a named class.*

## 2.4 Extracting data-containing elements

We've now identified where in the HTML our data lives. Using the extraction methods we learned earlier, we can isolate those elements from the soup.

### Exercise 2.4.1 Extracting chunks of data

Now that we know what tag and class contains all the info related to each beer, let's use `.find_all()` to extract only those elements from our `soup` object. Save those elements to a variable named `tap_list`, and display `tap_list`.

In [65]:
# Extract the element containing each beer's info,
# and name the result list tap_list
# Write your code here
tap_list = soup.find_all("div", class_="media-body")
tap_list

[<div class="media-body"><h4 class="media-heading"><a href="/beer/russian-river-pliny-the-elder/71">Russian River Pliny The Elder</a></h4><p class="separated"><span>Double IPA</span><span>8.0% ABV</span><span>100 IBU</span></p><p class="separated"><span>Russian River Brewing Company</span><span>Santa Rosa, CA</span></p><p class="text-muted small twolines"><em>Bright hop aromatics | Bitter finish</em></p><p class="tags"><span class="label label-info" title="Added within the last 48 hours">Fresh</span></p></div>,
 <div class="media-body"><h4 class="media-heading"><a href="/beer/2-towns-ciderhouse-easy-squeezy/6270590062297088">2 Towns Ciderhouse Easy Squeezy</a></h4><p class="separated"><span>Fruit Cider</span><span>5.0% ABV</span></p><p class="separated"><span>2 Towns Ciderhouse</span><span>Corvallis, OR</span></p><p class="text-muted small twolines"><em>Pink lemonade | Citrusy | Refreshing</em></p><p class="tags"></p></div>,
 <div class="media-body"><h4 class="media-heading"><a href="/

### Exercise 2.4.2 Printing extracted elements

Because of the bonus features attached to `tap_list`, we can use some of the same methods on chunks (individual beer entries) in `tap_list` that we used earlier on our `soup` object.

Assign the **third** list element of `tap_list` to a variable named "third_beer_entry", then use `.prettify()` to print it.

*Hint: remember that python uses 0-based indexing*

In [66]:
# Pretty-print the third element of tap_list
# Write your code here
third_beer_entry = tap_list[2]
print(third_beer_entry.prettify())

<div class="media-body">
 <h4 class="media-heading">
  <a href="/beer/barley-browns-pallet-jack-ipa/5064484871733248">
   Barley Brown's Pallet Jack IPA
  </a>
 </h4>
 <p class="separated">
  <span>
   American IPA
  </span>
  <span>
   7.0% ABV
  </span>
  <span>
   70 IBU
  </span>
 </p>
 <p class="separated">
  <span>
   Barley Brown's
  </span>
  <span>
   Baker City, OR
  </span>
 </p>
 <p class="text-muted small twolines">
  <em>
   Pine | Citrus | Tropical fruit
  </em>
 </p>
 <p class="tags">
 </p>
</div>



It's much easier to see the structure of these beer info chunks now!

If we printed each list element (beer entry) in `tap_list`, they would all share similar structure to what we see above. 

### Exercise 2.4.3: Identifying enclosing tags
Before we move on, take another look at `third_beer_entry` printed above and locate the **tags that *immediately* contain**:
- the beer name
- the brewery name

What kinds of tags are they?

## 2.5 Straining extracted elements

Clearly, there's a lot of extra data in each of these `tap_list` elements that we don't need, like the ABV and type of beer. We need to pull out:
- the beer name
- the brewery name

Because elements extracted from `soup` inherit some of the special methods of BeautifulSoup objects, we can use `.find_all()` on each list element in `tap_list` to further search within that beer entry.

As we discovered when printing the `third_beer_entry` above, the name of the beer sits inside a hyperlink or `<a>` tag. Lucky for us, there is only one hyperlink per beer entry. Because `<a>` is a unique tag within each beer entry:
- we can us the `<a>` tag to extract the name of the beer
- we can use `.find()` instead of `.find_all()` to return only one result

Let's try that now on the `third_beer_entry` variable.

### Exercise 2.5.1: Extracting the beer name

Use `.find()` to extract the first (and only) hyperlink element in `third_beer_entry`. Assign it to a variable named `a_tag`, and print it.

*Hint: hyperlinks use the `<a>` tag.*

In [67]:
# Extract one hyperlink from the 'beer_entry' object
# Write your code here
a_tag = third_beer_entry.find("a")
a_tag

<a href="/beer/barley-browns-pallet-jack-ipa/5064484871733248">Barley Brown's Pallet Jack IPA</a>

Once you extract it, check out the data type for `a_tag`:

In [68]:
# Run this code cell
type(a_tag)

bs4.element.Tag

Notice it's not a string, but rather a BeautifulSoup object called a Tag. Just like `third_beer_entry` has extra methods, `a_tag` has special methods inherited from the BeautifulSoup object `soup`. You can learn more about the Tag object [here](BS)

###  2.5.2 Stripping the tags

Now we've reached the smallest unit - the single HTML element - that immediately encloses the name of the beer:

`<a href="/beer/barley-browns-pallet-jack-ipa/5064484871733248">Barley Brown's Pallet Jack IPA</a>`

We used tags as signposts to navigate through our `soup` and `tap_list` objects until we reached our target data. However, we don't want to include the openeing and closing tags or attributes in our final data table. We only want the name of the beer, which is the **text** between the opening and closing `<a>` tags, `Barley Brown's Pallet Jack IPA`.

The way to extract only the text from a tag object is the `.get_text()` method, e.g. 
```python
bsTag.get_text()
```

### Exercise 2.5.2 Retriving the beer name as a simple string

Use the `.get_text()` method to extract the text from `a_tag`. Assign the result to `beer_name` and display `beer_name`.

In [69]:
# Write code to extract the text from the a_tag object
beer_name = a_tag.get_text()
beer_name

"Barley Brown's Pallet Jack IPA"

## 2.5.3 Combining our extraction steps into a for loop

We've written code to extract the beer name from a *single* beer entry of the list object we generated called `tap_list`. Now it's time to apply the same process to the rest of the entries in `tap_list`.

### Exercise 2.5.3: Putting it all together with loops

Let's write some code to extract all the beer names by looping through our `tap_list`. 

Construct a `for` loop using the code bits we've written above to find the tag containing the beer name and extract just the text. Add a final step in the loop that appends the beer name to the list `beer_names`, which has been initialized below. Finally, display the `beer_names` list.

In [70]:
# Fill in the '--'s to create a loop that extracts all the beer names from tap_list
beer_names = []

for beer in tap_list:
  a_tag = beer.find('a')
  beer_name = a_tag.get_text()
  beer_names.append(beer_name)

beer_names

['Russian River Pliny The Elder',
 '2 Towns Ciderhouse Easy Squeezy',
 "Barley Brown's Pallet Jack IPA",
 'Boneyard Elixir Lemon Ginger CBD',
 'Breakside IPA',
 'Breakside Up Top!',
 'Double Mountain Killer Juicy',
 'Ex Novo Vision Of Love',
 'Fort George Fresh Hop Field Of Greens',
 'GOW Ticklish warrior',
 'Heater Allen Pils',
 'Little Beast Dream State (2018)',
 'Sierra Nevada  Bigfoot Barleywine',
 'Sierra Nevada Celebration',
 'Stormbreaker Black is Beautiful',
 'pFriem Czech Dark Lager',
 '14 Hands Cabernet Sauvignon',
 'Acrobat Pinot Gris']

Congrats! You've just extracted a data type from a webpage and never once used copy/paste shortcuts!

Before we move on to extracting the brewery name, notice that the last two entries in `beer_names` are wineries, not breweries. You can ignore those two for now - we'll filter them out later. Just keep in mind that the last two entries of `tap_list` are wine, not beer, so their associated entries in `tap_list` might look a little different than the beer entries.

## 2.5.3 Extracting the brewery name

Let's repeat the process we just completed for beer names to now extract the brewery names.

### Exercise 2.5.3
Execute the code block below, then take another look at the tags within a single beer entry from `tap_list`. Locate the type of tag flanking the brewery name.

In [71]:
# Run this code cell
third_beer_entry = tap_list[2]
print(third_beer_entry.prettify())

<div class="media-body">
 <h4 class="media-heading">
  <a href="/beer/barley-browns-pallet-jack-ipa/5064484871733248">
   Barley Brown's Pallet Jack IPA
  </a>
 </h4>
 <p class="separated">
  <span>
   American IPA
  </span>
  <span>
   7.0% ABV
  </span>
  <span>
   70 IBU
  </span>
 </p>
 <p class="separated">
  <span>
   Barley Brown's
  </span>
  <span>
   Baker City, OR
  </span>
 </p>
 <p class="text-muted small twolines">
  <em>
   Pine | Citrus | Tropical fruit
  </em>
 </p>
 <p class="tags">
 </p>
</div>



### Exercise 2.5.2 Extract the brewery name from a single beer entry

Unlike the unique `<a>` tag that we used to extract the beer name, the brewery name is enclosed in a `<span>` tag. As evidenced in the printed entry above, `<span>` tags appear multiple times within a single beer entry. This means extracting the brewery name will be a little trickier than extracting the beer name.

Before we build a loop, let's first extract the `<span>` tags from a single beer entry. 

In terms of strategy, we'll need to:
- Use `.find_all()` instead of .`find()`, which returns a list of results
- Use list indices to select the `<span>` element containing the brewery name from the list of results
- Assign the **text only** of the single tag containing the brewery name to a variable called `brewery_name`, and display it

In [72]:
# Fill in the '--'s to extract the brewery name
span_list = third_beer_entry.find_all('span')
brewery_span = span_list[3]
brewery_name = brewery_span.get_text()
brewery_name

"Barley Brown's"

### Exercise 2.5.2: Extract the brewery names for all beer entries in `tap_list`

Adapting the code above for a single beer entry, write a `for` loop to extract all the brewery names and append them to a new list named "breweries". Don't forget to disregard the last two entries of `tap_list` that are wine and not beer.

**NOTE: If you don't get the results you expect, how might you adapt your code to handle slight variations in the tag structure of each beer entry in `tap_list`?**

*Hint: There are two directions you can move through a list - from the beginning or from the end.*

In [73]:
# Fill in the '--'s to loop through tap_list, extract brewery names, and assign to "breweries" list
breweries = []

for beer in tap_list:
  span_list = beer.find_all('span')
  brewery_span = span_list[-2]
  brewery_name = brewery_span.get_text()
  breweries.append(brewery_name)

breweries

['Santa Rosa, CA',
 '2 Towns Ciderhouse',
 "Barley Brown's",
 'Boneyard Elixir LLC',
 'Breakside Brewery',
 'Breakside Brewery',
 'Double Mountain Brewery',
 'Ex Novo Brewing Co.',
 'Fort George Brewery',
 'Grains Of Wrath',
 'Heater Allen Brewing',
 'Little Beast Brewing',
 'Sierra Nevada Brewing Company',
 'Sierra Nevada Brewing Company',
 'StormBreaker Brewing',
 'pFriem Family Brewers',
 'Washington',
 'Oregon']

Is it too early to crack a beer? Because we now have all the data we aimed to collect - beer name, and brewery name - in two neat lists!

# Step 3: Formatting the data 

This last part is breeze. We'll use the `pandas` library to turn our lists into a dataframe.

First, we import the `pandas` library.

In [74]:
# Run this code cell
import pandas as pd

Because we plan to populate our database for the beer alert feature with tap lists from other breweries, it's important to indicate the bar we can find each beer at. Let's create another data column to capture the bar name.

### Exercise 3.1 Create a bar name list
Because all of our data came from the same bar - Belmont Station - we need to generate a list that repeats the bar name for the number of times that matches the number of entries.

Create a list called `bars` that is the same length as our `names` and `breweries` lists, and populate it with the string "Belmont Station".

*Hint: The multiplication function can be used on lists.*

In [75]:
# Fill in the '--' to create the list bars of the same length as the 
# names variable and populte with the string "Belmont Station"
num_beers = len(beer_names)
bars = ['Belmont Station'] * num_beers
bars

['Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station',
 'Belmont Station']

### Exercise 3.2 Build the data table

We'll now use `pandas`' DataFrame constructor to build our data table. 

Let's call the DataFrame `active_taps`, and name our columns "beer_name", "brewery", and "bar_name".

*Hint: There are a couple ways to do this, but I like to pass in a dictionary to the DataFrame constructor where the keys are the data column names, and the values are the data lists. For example:*
```python
pd.DataFrame({'col1_name': list1, 
              'col2_name': list2, 
              'col3_name': list3})
```

Complete the code block below. The column names are provided for you. Fill in the list name that goes with each of these column names.

In [76]:
# Fill in the '--'s with our data lists to construct a DataFrame
active_taps = pd.DataFrame({
    "beer_name": beer_names,
    "brewery": breweries,
    "bar_name": bars
})

active_taps

Unnamed: 0,beer_name,brewery,bar_name
0,Russian River Pliny The Elder,"Santa Rosa, CA",Belmont Station
1,2 Towns Ciderhouse Easy Squeezy,2 Towns Ciderhouse,Belmont Station
2,Barley Brown's Pallet Jack IPA,Barley Brown's,Belmont Station
3,Boneyard Elixir Lemon Ginger CBD,Boneyard Elixir LLC,Belmont Station
4,Breakside IPA,Breakside Brewery,Belmont Station
5,Breakside Up Top!,Breakside Brewery,Belmont Station
6,Double Mountain Killer Juicy,Double Mountain Brewery,Belmont Station
7,Ex Novo Vision Of Love,Ex Novo Brewing Co.,Belmont Station
8,Fort George Fresh Hop Field Of Greens,Fort George Brewery,Belmont Station
9,GOW Ticklish warrior,Grains Of Wrath,Belmont Station


### Exercise 3.3: Toss the wine rows
Finally, let's ditch our last two rows, which contain wine data, not beer data.

Drop the last two rows and save it back to the `active_taps` DataFrame.

In [77]:
# Fill in '--' to drop the last two rows of the DataFrame
active_taps = active_taps[:-2]
active_taps

Unnamed: 0,beer_name,brewery,bar_name
0,Russian River Pliny The Elder,"Santa Rosa, CA",Belmont Station
1,2 Towns Ciderhouse Easy Squeezy,2 Towns Ciderhouse,Belmont Station
2,Barley Brown's Pallet Jack IPA,Barley Brown's,Belmont Station
3,Boneyard Elixir Lemon Ginger CBD,Boneyard Elixir LLC,Belmont Station
4,Breakside IPA,Breakside Brewery,Belmont Station
5,Breakside Up Top!,Breakside Brewery,Belmont Station
6,Double Mountain Killer Juicy,Double Mountain Brewery,Belmont Station
7,Ex Novo Vision Of Love,Ex Novo Brewing Co.,Belmont Station
8,Fort George Fresh Hop Field Of Greens,Fort George Brewery,Belmont Station
9,GOW Ticklish warrior,Grains Of Wrath,Belmont Station


There you have it, from soup to tidy data!

From here, you could export the data to a .xlsx or .csv, or build a SQL database, etc., that our hypothetical beer app could then pull in data from to perform our "favorite beer alert" feature.

# Additional exercises
If you had fun web scraping, here's a few more exercises you can try on your own using this Taphunter HTML:
- extract the beer type
- extract the ABV
- extract the brewery location
- extract the bar's hours and phone number
- turn each extraction step into a function
- write a TapHunter web scraping script and try it out on another bar's TapHunter page

I highly encourage you to try out your web scraping skills on other websites, as every website's HTML will be as unique as the developers that built it.

## Big thanks!
I hope you had fun learning how to data-ify the internet! Thanks so much for your time and attention today. 

I also want to thank the BioData Club organizers (Robin Champeaux, Ted Laderas, Marijane White, and Eric Earl) for this opportunity as well as their support & feedback.

# Resources

- [requests](https://requests.readthedocs.io/en/master/) library documentation]
- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library documentation
- [pandas](https://pandas.pydata.org/docs/user_guide/index.html) library documentation
- If you like how-to coding books, check out *Web Scraping with Python* by Ryan Mitchell ([link](https://books.google.com/books/about/Web_Scraping_with_Python.html?id=v_k6jwEACAAJ))