# Web Scraping with Python

Now that we're experts in (the basics of) Python, it's time to head out to the real world and test our Python mettle.

One problem that comes up over and over in research applications is that we find a website that has the exact data we need for our work, but they don't provide an API for us to access it cleanly. Instead, we may see the data either in a table on the website, or even formatted so we have to click on links and jump around the site to see all of the data we need. As a result, we end up copying data by hand, a tedious, time-consuming, and error-prone endeavor.

## 0. The Problem

Take for example this list on Wikipedia of named lakes in California: https://en.wikipedia.org/wiki/List_of_lakes_in_California

Let's pretend we need such a list for a project we're working on. Specifically, we want a list of lakes and information about them, including their geographic coordinates, surface area, maximum depth, and water volume. It would be a real pain to copy every entry in that table. Moreover, we cannot see details such as coordinates, water volume, and maximum depth without clicking on each of the links.

Rather than copying all these data by hand, we will use Python to automate the data collection process, to produce a tidy CSV we can use in our research.

## 1. A word of caution

Web scraping falls in a murky legal area, but courts appear to be [looking favorably](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-information-not-computer-crime) on the practice.

First and foremost, you should always make sure the data are available for you to use in your work. Sometimes the incantations of a website's terms & conditions will prohibit this entirely. For example, an exercise we could have done is to collect snowfall data for California ski resorts from [OnTheSnow](https://www.onthesnow.com). However, [that website's terms and conditions](https://www.onthesnow.com/terms) explicitly prohibit automated---and even manual!---data collection. 

Even when a website does not explicitly prohibit scraping---and even if courts decide terms prohibiting scraping are meaningless---you should always be considerate when collecting data. When you open a website, there is a piece of software running on a physical computer somewhere that sends you the content you've requested. These resources are **not free and unlimited**: someone is paying to run this infrastructure, and it can only handle a limited number of requests at a time. When you automate web requests, it's easy to make thousands of requests very quickly. At worst, [this can overwhelm the website and bring it down](https://en.wikipedia.org/wiki/Denial-of-service_attack), making it unavailable to anyone who wants to access it. In many cases, the server will detect that you are a robot and block you. (Note also that when you are on a shared network such as Stanford's, the server's block may affect your peers as well!)

### 1.1 How do I respect the server?

Generally this just means rate-limiting your requests. In Python we will use the `time.sleep` function to pause your script for a specified number of seconds (say, 1 second) between requests.

### 1.2 What else should I fear?

<img src="img/recaptcha.png" width=300 />

This is a ReCAPTCHA. Google provides this service to websites that wish to discourage automated access. There are a lot of details about ReCAPTCHA, but the bottom line is: if you start seeing them pop up, you've been busted as a web scraper and need to throttle your script more. You may be out of luck for scraping your target website until Google lets you out of its digital doghouse.

## 2. Getting started

Now let's dive in.

First we'll install some new modules:

 - [`requests`](https://2.python-requests.org/en/master/) - For all of the human-centeredness of Python, the standard library `urllib` is a real pain to make web requests with. Instead we'll use this third-party module that makes HTTP requests dead simple.
 - [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/) - This is a powerful module that parses HTML, the markup languages used to display websites. It deals gracefully with all sorts of messy situations you will encounter in the wild, where websites are written improperly. Generally you can trust it to do the right thing and you don't need to know any of the details of what that means.

In [1]:
import sys

!{sys.executable} -m pip install requests beautifulsoup4

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K    100% |████████████████████████████████| 102kB 2.3MB/s 
Collecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/0b/44/0474f2207fdd601bb25787671c81076333d2c80e6f97e92790f8887cf682/soupsieve-1.9.3-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.0 soupsieve-1.9.3


In [2]:
!{sys.executable} --version

Python 3.7.1
