# WYSIWYG Scraping

There's no arguing that nerds love acronyms. And one of the best acronyms, imho, is [WYSIWYG](https://en.wikipedia.org/wiki/WYSIWYG) (pronounced wiz-ee-wig): *What You See Is What You Get*.

Originally, WYSIWYG referred to software that "[allows users to see and edit content in a form that appears as it would when displayed on an interface, webpage, slide presentation or printed document."](https://www.techtarget.com/whatis/definition/WYSIWYG-what-you-see-is-what-you-get)

That was a sea change from having to, for example, edit HTML directly. Instead, you could hit a button on an editing panel to make some text **bold**, and the text would display in an editing window as **bold** (as opposed to displaying raw HTML such as `<strong>bold</strong>`).

Why bring this up?

Because it's a useful analogy for web pages. Back in the days of yore, many (perhaps most?) websites followed the WYSIWYG principle. These were simpler times, when the content displayed on a web page closely matched the HTML in the underlying document for a page.

If your web browser showed a table of data, it was quite likely that you'd find a `<table>` element somewhere in the page's HTML. 

This made web scrapers happy.

We could write simple code to grab the page, pluck out the data, and be on our way.

## A WYSIWYG example.com

Those halcyon [days are waning](drive_the_browser_robot.ipynb), but you can still find websites in the wild that are coded this way.

For example, compare the browser display of <http://example.com> and its underyling HTML.

![example dot com display vs HTML](../files/example_dot_com_wysiwyg.png)

## Failed Banks Simple Scrape

Similar to <http://example.com>, the [FDIC's list of failed banks](https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/) also embeds its content -- at least the data we care about -- in the HTML source code. The data on the site is "paged" -- meaning that it only displays the first 10 banks by default. To view or scrape all 500-plus banks, you would need to step through each of the pages to gather all the content. For this tutorial, we'll focus on the simpler task of scraping just the 10 banks on the home page, and leave page-handling as an exercise at the end of this tutorial.

> IMPORTANT: The site also provides the data as a downloadable CSV, which is a better alternative to scraping. But we'll ignore that for now and use the page as an opportunity to practice basic scraping techniques.

How do we know that the HTML source code contains the bank data?

By using the browser's developer tools to [dissect the website](dissecting_websites.ipynb).

## View Page Source

Step 1 involves right clicking on the page and selecting `View Page Source`.

> *Note: The precise wording of the menu options may vary by browser. We recommend Firefox or Chrome, by the way.*

![FDIC view page source](../files/fdic_banks_html.png)

## Compare What You See to Page Source

The next step involves some basic code sleuthing: 

- Review the list of banks as displayed in the browser. For example, we can see above ☝️that  `Pulaski Savings Bank` in Illinois is at the top of the list.
- Now head over to the HTML source page you just opened (using `View Page Source`) and search the page for `Pulaski Savings Bank`.

You should see that the data is embedded in the HTML.

![FDIC html source code](../files/fdic_view_page_source.png)

## Get ready to scrape

This simple page allows us to use two of the traditional workhorses of Python web scraping: the [requests](https://requests.readthedocs.io/en/latest/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries.

The former gives us the ability to fetch the source HTML for the page. The latter provides a Pythonic "interface" to the HTML that lets us easily extract data points.

## Fetch the page

> **IMPORTANT**: This section of the tutorial using `requests` only works on GitHub Codespaces or locally (if you've cloned this repo to your own machine).

Let's start our scrape by requesting the page content.

In [None]:
import requests
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
response = requests.get(url)

Now inspect the `response` to see what we're working with.

In [None]:
# Get the type
type(response)

In [None]:
# Ouch, that's a long list!
dir(response)

If you review the `response` object's attributes (see the [Hidden Life of Objects](../classes_and_oop/hidden_life_of_objects.ipynb) for background), you'll notice there are some potentially handy bits of data and functionality. 

> It's not a bad idea to review the [requests documentation on responses](https://requests.readthedocs.io/en/latest/user/quickstart/#response-content) as well. 

For our purposes, we're going to grab the `text` of the response, which should contain the raw HTML of the page.

Prepare yourself for a LOT of HTML.

> BTW, you can clear the output by selecting `Edit -> Clear Output` in the Jupyter Lab file menu.

In [None]:
response.text

And the data type of `response.text`?

In [None]:
type(response.text)

Not surpisingly, it's plain old text (aka a string).

## A Hearty Soup

We're now ready to begin plucking the bank data out of this big blob of text using `BeautifulSoup`, which is an HTML parsing library that makes it simple to extract information from web pages. 

> Why soup? Perhaps because HTML is a beautiful blend of savory tags and other ["markup"](https://en.wikipedia.org/wiki/Markup_language). What kind of soup, you ask? Clearly it's a [goulash](https://en.wikipedia.org/wiki/Goulash).

In [None]:
import bs4

# Create "soup" using bs4's standard, built-in HTML parser
soup = bs4.BeautifulSoup(response.text, 'html.parser')

Let's find out what we're working with.

In [None]:
type(soup)

If we hit the BeautifulSoup docs, we'd learn that the [BeautifulSoup object](https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup) provideds all sorts of nifty ways to navigate and extract content from HTML.

For example, we can select page elements by their HTML tags.

In [None]:
table = soup.find_all('table')
len(table)

Looks like there's only one table on the page. And recall that our data is stored in a proper HTML table, so this seems like a promising path.

## Grab the headers

Our HTML table has header rows containing field names.

Let's grab the row and see what we're working with.

In [None]:
headers = soup.find_all('th')
# Print the first column
print(headers[0])

So we have a `th` element -- which stands for *table header* --and that in turn contains a link or *anchor* (`a`) tag. 

If we zoom out to the table level, the HTML tags are nested inside of each other in a structure that looks like below:
    
```
table
  thead
    tr
      th
        a
```

The neat thing about BeautifulSoup is that it allows you to navigate, or "walk", the [tree structure of HTML](https://en.wikipedia.org/wiki/Document_Object_Model) by using [dot notation](../classes_and_oop/README.ipynb).

In [None]:
# Grab the header text for the first column, 
# starting with the "th" tags stored in the headers variable
headers[0].a.text

Using this strategy, we can grab the text for all header cells.

Along the way, we'll do some basic cleanups on the field names. In particular, the last column called `Fund` is a big gnarly in its raw form and looks like below:

```text
 'Fund\n\n    Sort ascending\n      \n\n'
```
So we'll use the `strip` function to just grab the word `Fund`. The code as written should not affect other column headers.

In [None]:
column_names = []
for th in headers:
    col_name = th.a.text
    clean_name = col_name.split('\n')[0] # split on newline characters, grab base column name from beginning of list
    column_names.append(clean_name)
column_names

## Nab the data

We can repeat the above process to also extract the bank data, which is nested inside of the `tbody` tag.

The data is stored in an HTML structure that looks like below, where `tr` represents a single *table row* and `td` represents *table data*. There's a `td` element for each field in the row.

```
table
  tbody
    tr
      td
      td
      etc
```

In [None]:
# note we use the singular "find" for tbody, 
# which returns the first element matching the tag name
tbody = soup.find('tbody') 

# Grab all the rows inside tbody
rows = tbody.find_all('tr')

# Print number of rows (fact-check this against the count on the FDIC site)
print(len(rows))

Let's inspect one of the rows to get a handle on the HTML structure.

In [None]:
rows[0]

Once again, we see that `td` tags are used to store the values for individual columns. 

Let's collect the bank data. We'll perform some basic data clean-ups along the way and store the value from each row in a [dictionary](../python_dict_basics.ipynb).

In [None]:
all_banks = []
for row in rows:
    fields = row.find_all('td')
    field_values = [
        fields[0].text.strip(),
        fields[1].text.strip(),
        fields[2].text.strip(),
        fields[3].text.strip(),
        fields[4].text.strip(),
        fields[5].text.strip(),
        fields[6].text.strip()
    ]
    # Mash up the headers with the field values into a dictionary
    # - zip pairs each column header with the corresponding field in a two-element list
    # - dict transforms the list of column/value pairs into a dictionary
    bank_data = dict(zip(column_names, field_values))
    all_banks.append(bank_data)
    
# "Pretty Print" a row for inspection
from pprint import pprint
pprint(all_banks[0])

How many rows of data do we have?

In [None]:
len(all_banks)

Does that number match the count on the FDIC site?

## Summary

Congratulations! You've scraped the first page of bank data from the FDIC Failed Banks List!

Let's see you flex those coding muscles on a few additional exercises.


## Exercises

### Bank Detail Page Links

If you'd like some more practice with BeautifulSoup, try modifying the code in this notebook to also extract the URL for the bank "detail" page. 

You should get started by circling back to the HTML source code and pinpointing the URL location for banks. *Hint: The URL is stored on an [attribute](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) of an `a` tag.*

Hit the BeautifulSoup docs to learn [how you can access tag attributes](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).

### Get all the data

We mentioned earlier that the data on the FDIC site is "paged":

![FDIC paging of results](../files/fdic_paging.png)

If you were to click the link for the next page, you would notice that it takes you to a new page with a slightly different URL.

![FDIC page 1 results](../files/fdic_page1_url.png).

You might further notice the extra parameter on the URL: `?page=1`. 

Although this is actually the second page of results, it is perhaps confusingly called page 1. Computers, unlike humans, are fond of using `0` to identify the first item in a list, and the coders behind this site clearly let the machines decide on its pagination scheme.

In any event, the important thing to understand is that these are [predictable URLs](website_personalities.ipynb#Predictable-URLs-and-Query-Strings) -- ie URLs that you can construct predictably if you know the total number of pages. And the good news is that we do!

Your mission for this exercise is to update this notebook to handle the following tasks:

- Extract the total page count from the "paging" section at the bottom
- Create a [*for* loop](../python_syntax_crash_course.ipynb#Loops) that steps through the full range of pages based on their number. *Hint: Check out Python's [range](https://www.w3schools.com/python/ref_func_range.asp) function for a simple approach to this problem.*
- Extract the data from each page and store it in a list
- **Bonus points**: Use the [csv](../python_csv.ipynb) library to export the data to an external file.

## What's next?

The next step in our journey involves tackling more complex sites using a [robot to drive a web browser](drive_the_browser_robot.ipynb). Onward and upward!