# Web Scraping with Python

Now that we're experts in (the basics of) Python, it's time to head out to the real world and test our Python mettle.

One problem that comes up over and over in research applications is that we find a website that has the exact data we need for our work, but they don't provide an API for us to access it cleanly. Instead, we may see the data either in a table on the website, or even formatted so we have to click on links and jump around the site to see all of the data we need. As a result, we end up copying data by hand, a tedious, time-consuming, and error-prone endeavor.

## 0. The Problem

Take for example this list on Wikipedia of named lakes in California: https://en.wikipedia.org/wiki/List_of_lakes_in_California

Let's pretend we need such a list for a project we're working on. Specifically, we want a list of lakes and information about them, including their geographic coordinates, surface area, maximum depth, and water volume. It would be a real pain to copy every entry in that table. Moreover, we cannot see details such as coordinates, water volume, and maximum depth without clicking on each of the links.

Rather than copying all these data by hand, we will use Python to automate the data collection process, to produce a tidy CSV we can use in our research.

## 1. A word of caution

Web scraping falls in a murky legal area, but courts appear to be [looking favorably](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-information-not-computer-crime) on the practice.

First and foremost, you should always make sure the data are available for you to use in your work. Sometimes the incantations of a website's terms & conditions will prohibit this entirely. For example, an exercise we could have done is to collect snowfall data for California ski resorts from [OnTheSnow](https://www.onthesnow.com). However, [that website's terms and conditions](https://www.onthesnow.com/terms) explicitly prohibit automated---and even manual!---data collection. 

Even when a website does not explicitly prohibit scraping---and even if courts decide terms prohibiting scraping are meaningless---you should always be considerate when collecting data. When you open a website, there is a piece of software running on a physical computer somewhere that sends you the content you've requested. These resources are **not free and unlimited**: someone is paying to run this infrastructure, and it can only handle a limited number of requests at a time. When you automate web requests, it's easy to make thousands of requests very quickly. At worst, [this can overwhelm the website and bring it down](https://en.wikipedia.org/wiki/Denial-of-service_attack), making it unavailable to anyone who wants to access it. In many cases, the server will detect that you are a robot and block you. (Note also that when you are on a shared network such as Stanford's, the server's block may affect your peers as well!)

### 1.1 How do I respect the server?

Generally this just means rate-limiting your requests. In Python we will use the `time.sleep` function to pause your script for a specified number of seconds (say, 1 second) between requests.

### 1.2 What else should I fear?

<img src="img/recaptcha.png" width=300 />

This is a ReCAPTCHA. Google provides this service to websites that wish to discourage automated access. There are a lot of details about ReCAPTCHA, but the bottom line is: if you start seeing them pop up, you've been busted as a web scraper and need to throttle your script more. You may be out of luck for scraping your target website until Google lets you out of its digital doghouse.

## 2. Getting ready

First we'll install some new modules:

 - [`requests`](https://2.python-requests.org/en/master/) - For all of the human-centeredness of Python, the standard library `urllib` is a real pain to make web requests with. Instead we'll use this third-party module that makes HTTP requests dead simple.
 - [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/) - This is a powerful module that parses HTML, the markup languages used to display websites. It deals gracefully with all sorts of messy situations you will encounter in the wild, where websites are written improperly. Generally you can trust it to do the right thing and you don't need to know any of the details of what that means.

In [None]:
import sys

!{sys.executable} -m pip install requests beautifulsoup4

## 3. Diving in

Now we'll take a first look at the lakes website we want to scrape: https://en.wikipedia.org/wiki/List_of_lakes_in_California

What we see is a table with some information and some hyperlinks to pages with more detail:

<img src="img/list_of_lakes.png" width=800 />


### 3.1 Inspecting HTML 

Before we start writing anything in Python, we need first to understand how this website is put together.

To do this, we will **inspect** the HTML of the Website.

In Chrome, try right clicking on the first column header that reads "Name" and select "Inspect" on the menu that appears.

<img src="img/inspect.png" width=500 />

Now you can see how that table is constructed. In the window that pops up, you should see something like the following markup:

<img src="img/lakes_table_markup.png" width=500 />

#### 3.1.1 But wait, what is this "HTML" ... ?

If you aren't familiar with HTML, don't worry too much. But, you will need to learn eight fun facts about it very quickly:

1. Every website you've ever seen is formatted with **HTML**.
2. HTML is a **markup language**. That means it is not a _programming_ language to give computer arbitrary instructions; rather it simply dresses up plain text to describe how it should be presented.
3. The symbols used to dress up text are called **tags**.
4. The name of the tag is always enclosed in angle brackets. Generally there is an **opening tag** written like `<foo>` followed later by a **closing tag** written as `</foo>`.
5. The combination of an opening tag, its content (if there is any), and a closing tag is called an **element**.
6. For example, to make text bold, I might use the : `I AM SHOUTING <strong>LOUDLY</strong>`. (The `LOUDLY` will be rendered in boldface.)
7. HTML tags can be **nested**. As a result, you can think of the entire document as a **hierarchy** of tags. The content of the website will always be found enclosed within a `<body></body>` tag. For example, `<body><p>Hello, <em>friend</em>!</p></body>` will display a friendly greeting inside a paragraph (`<p>`), with the word "friend" italicized (`em`).
8. HTML tags can have **attributes**. Some attributes are semantic and change what a tag means on the page. Other attributes are cosmetic and change only how the tag appears. Still other attributes serve as metadata that is used for a variety of purposes, perhaps to identify that tag among similar tags on a page, or to add annotations for accessibility devices.

#### 3.1.2 What can HTML do for me, an aspiring web scraper?

A webpage may have a large volume of text decorated with hundreds or thousands of HTML tags. Within that gigantic haystack is the tiny needle of information you want to pluck out of it. Your job in writing a scraper is to write a script that can not only locate that needle on one website, but is robust enough to locate the needles in a set of haystacks.

Fortunately, we are typically dealing with reasonably well-structured haystacks with needles that have been placed systematically into them, in a predictable location. So if we find the needle in one haystack, we just have to write down the description of the location so we can find the needle in all haystacks.

We will describe the needle's location in terms of the HTML tags and their attributes. Once we inspect the page to locate the element that contains the information we want to scrape, there are three main observations we want to make about our target:

1. What type of tag is this? Some elements have well-defined structures that simplify parsing. For example, `<table>` elements always have rows defined by `<tr>` elements and cells within those rows defined by `<td>` elements.
2. Does this tag have any identifying attributes? The gold standard is to find an element with an `id` attribute, because these are (nearly) always unique on the page. Often you will find a `class` attribute that is also very useful.
3. Where does this element fall in the page hierarchy? Do any of this element's parents have identifying marks? Perhaps the element that immediately contains your needle is a plain old `<span>` tag ... but maybe directly above it there is a `<div>` with an ID that you can use to locate the element.

By combining these techniques, we can create a heuristic rule that finds the needle in all of the pages we are targetting. Some examples of the heuristics we're writing are:

 - "Find the `<div>` the `id="my_data"` and then take the text of the third `<span>` tag inside of it"
 - "Find the second table on the page and extract the `href` attribute from every `<a>` tag inside the second cell in all but the first row."
 - "Extract the text of every `<span>` with the class `"my-value"` 

### 3.1.3 That sounds tedious!

It can be! But often it's easier than it sounds. Let's try it.

### 3.2 Fetching a website in Python

First, we will use `requests` to load the HTML of the lakes Wikipedia page.

In [None]:
import requests

# Fetch the Lakes website
response = requests.get("https://en.wikipedia.org/wiki/List_of_lakes_in_California")

You can explore the `response` object as follows:

In [None]:
# Check the HTTP status code. This should be 200 -- anything between 200 and 299 is a good thing.
# Status codes of 300 or higher mean Trouble of varying degrees!
response.status_code

In [None]:
# Check the text returned by the web server. This is the HTML you inspected in Chrome!
response.text

There's a lot more information contained in HTTP responses, but these are the main things you need.

### 3.2 Making sense of the `response`

The `response` object's `text` attribute is just that---text. In order to work with it, we need to parse the structure of the HTML. This is where `BeautifulSoup` comes in.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text)

Now `soup` contains the full parsed HTML from the lakes website. In other words, instead of plain text, the page is now represented in a data structure we can easily navigate and query.

To prove this, let's check out the `<title>` tag of the website:

In [None]:
soup.title

We can also extract just the inner text of that element:

In [None]:
soup.title.text

We can do more sophisticated things. For example, to find all the links on the page, we can query for the `<a>` tags:

In [None]:
soup.find_all("a")

And to pull the URLs from all of these links, we can extract the `href` attribute from every element:

In [None]:
[a.get("href") for a in soup.find_all("a")]

We can further refine our queries by looking for specific attributes. For example, to find all of the section headers, we note that they look like `<span class="mw-headline" id="List">List</span>` and write a query for the `class` attribute, such as:

In [None]:
soup.find_all("span", attrs={"class": "mw-headline"})

#### 3.2.1 Exercise

Find all of the table rows that list lakes and store them in a variable named `lake_rows`.

Things to consider:

1. How do you find the right `<table>` element?
2. What HTML tag defines table rows?
3. Are there rows in the table that don't contain information on lakes?

In [None]:
lake_rows = ...

Now for every column in the table, extract the data contained in each table cell element and store these data in a variable called `lake_data`.

Things to consider:
1. What element contains table cells?
2. Are there elements nested within table cells?
3. What's the correct way to store links?
4. What is the data structure you are storing in `lake_data`?

Bonus: Are there other ways you would manipulate the text you're extracting?

In [None]:
lake_data = ...

## 3.3 Crawling 🏊

Now we have all the data on the main Wikipedia page about California lakes. Great! In some real-world scenarios, this may be all you need to do.

But in our case we will go a step further. To get the details of water volume, geo coordinates, etc., we need to navigate to each page for each lake and extract this info from the info panel:

<img src="img/lake_info.png" width=250 />

### 3.3.1 Exercise

Write a loop over all the lake links and scrape the contents of those sites.

**IMPORTANT** This is where you should be sure to use `time.sleep` between requests to ensure you don't overload the server (or, more realistically in this case, get blocked).

Things to consider:

1. Not all of the lakes have Wikipedia entries. How do you know? How do you handle this case?
2. Not all of the lake pages have all (or any) of the information we want. How do you handle these cases?
3. How do you update the `lake_data` structure to store new data?

In [None]:
import time

for d in lake_data:
    # 1) Fetch the HTML for the lake's Wikipedia page
    # 2) Extract the relevant data from the HTML
    # 3) Store the extracted data with the rest of the `lake_data`
    
    # Sleep for 1 second before proceeding to the next link
    time.sleep(1)

## 3.4 Saving your data

Now we have a dataset! We can work with these data directly in our Python session, but often we will want to save our results to disk so that we can either restore them in a future Python session or use them in another tool, such as R.

To do this we will use Python's built-in [`csv`](https://docs.python.org/3/library/csv.html) module.

**ASIDE** Often for working with rectangular data in Python people use a library called [`pandas`](https://pandas.pydata.org/). `Pandas` is outside the scope of this workshop. My personal workflow is usually to save scraped data as a CSV and then to load the CSV in R to work with the data. I don't usually use `pandas`; it does, however, provide convenient functions for reading and writing CSVs. We encourage you to look into `pandas` after completing this workshop if you are interested.

### 3.4.1 How to use `csv`

The [`csv`](https://docs.python.org/3/library/csv.html) module normally writes a simple list of values:



In [None]:
import csv

with open("test.csv", "w") as fh:
    # Create a CSV writer object targetting the output file
    writer = csv.writer(fh)
    # Write a header row
    writer.writerow(["a", "b", "c"])
    # Write a value row
    writer.writerow([1, 2, 3])

It is also capable of writing a dictionaries as well:

In [None]:
with open("test.csv", "w") as fh:
    # Create a DictWriter object, specifying the output file and the order of field names
    writer = csv.DictWriter(fh, ["a", "b", "c"])
    writer.writeheader()
    writer.writerow({"a": 1, "b": 2, "c": 3})

### 3.4.2 Exercise

Save the lake data to a CSV file.

In [49]:
with open("lake_data.csv", "w") as fh:
    # Write a header and all the value rows.
    ...

## 4. Working with scraped data

Well, here we are with some time to kill and a bunch of lake data. What do we do?

1. What's the aggregate water volume for all reservoirs in California? (Hint: what, if anything, do you have to change about your extraction technique to make this calculation possible?)
2. What's the northernmost natural lake in California? (Southernmost? Easternmost? Westernmost?)
3. What county has the largest water surface area? Smallest? (What's unfair about this question, and how might you remedy that? How could you use scraping to help?)


In [52]:
# Find the answers!

## 5. Bonus

Have extra time? The web is your oyster. Find another site to scrape and have at it!

Some ideas:
 - [Wikipedia's List of Lists of Lists](https://en.wikipedia.org/wiki/List_of_lists_of_lists)
 - [San Francisco tree database](https://www.fuf.net/resources-reference/urban-tree-species-directory/)
 - _TODO(jnu): add more_
 
## 6. Next steps
 
You now have the basic ability to turn the web into data. There are a number of complicated circumstances you might still encounter. Here are some common situations that arise, and some ideas for how you might begin to navigate them:

_TODO(jnu): add resources_
 
1. What if you need to execute JavaScript? JavaScript can mutate the HTML of a page in ways that are not visible to the `requests` library.
2. What if you need to manipulate form elements? E.g., to execute a search query to get the list you want to scrape.
3. What if you need to log in to the website before you can see the page you want to scrape?
4. What if the data you want to scrape is largely unstructured free text?
5. What if you hit a CAPTCHA, even though the website doesn't forbid scraping in its terms?