# Web Scraping with Python

Note: I've made a Scraper class object for this tutorial. You can find it in the Scraper.py file.

### Write raw html

If you're writing and debugging code, store the data locally. There are two good reasons:
1. Sending the same request to a server dozens or possibly hundreds of times puts unnecessary burden on their systems.
2. It's faster to pull data from a local file.

Always write encoded bytes. If you don't yet know why, [check out Ned Batchelder's PyCon talk](https://www.youtube.com/watch?v=sgHbC6udIqc). Watch every second of it. 

In [1]:
import random
import requests

url = "http://www.sports-reference.com/cbb/boxscores/"
# "/../" denotes the directory directly above the current working directory.
path = './../html/today\'s box scores.txt'

# Make ID line
ID = hex(random.randrange(16**30))
ID_line = '<!-- ID: {} -->\n'.format(ID).encode('utf-8')

# Make request
r = requests.get(url)

# We're joining a byte string. We need to denote this with b""
html = b''.join((ID_line, r.content))

with open(path, 'wb') as ofile:
    ofile.write(html)

### Bonus Tip 1

`r.content` returns encoded bytes (print `r.content` and notice the "b" before "<!doctype html>...")  
`r.text` returns decoded utf-8 

### Bonus Tip 2
You can control more information sent by the request by creating a `requests` Session object. You'll want to change the headers if you're scraping more sophisticated webpages. Copy the headers used through your favorite VPN and use them with your session object! But use caution...

```
session = requests.Session()
session.headers = {  # These are the default headers, for example
    'Accept-Encoding': 'gzip, deflate', 
    'User-Agent': 'python-requests/2.10.0', 
    'Connection': 'keep-alive', 
    'Accept': '*/*'
}
r = session.get(url)

```

---
# Make the crawler _crawl_

Note: We will cover `BeautifulSoup` more deeply in the data mining tutorial.

We often want to follow internal links on webpages to get to more interesting data. This is _crawling_. To do this, we'll want to use `BeautifulSoup` to help us navigate HTML documents.

If you get confused along the way, I encourage you to take [Codecademy's HTML Basics course](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwiBv5OsnPnRAhVs9IMKHUYgAVwQFgglMAI&url=https%3A%2F%2Fwww.codecademy.com%2Fcourses%2Fweb-beginner-en-HZA3b&usg=AFQjCNHx-r5eaJMv2t-K7FNN3V_4bz7f9A&sig2=CVR05UOaXFlygapyXPeEaw). It's short and explains things better than I can in this workshop.

Here's the plan of attack:

1. Identify links in a webpage.
2. Follow those links.

### Quick Intro to BeautifulSoup

In [2]:
from bs4 import BeautifulSoup

html = """
<!DOCTYPE html>
<html>
    <head>
        <title>Nick’s Workshop</title>
    </head>
    <body>
        <h1>
            <a href="/workshop">
            Nick’s Workshop
            </a>
        </h1>
        <h1>Welcome!</h1>
    </body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
soup


<!DOCTYPE html>

<html>
<head>
<title>Nick’s Workshop</title>
</head>
<body>
<h1>
<a href="/workshop">
            Nick’s Workshop
            </a>
</h1>
<h1>Welcome!</h1>
</body>
</html>

We can navigate the tree using html tags in one of two ways:

1. BeautifulSoup attributes
2. BeautifulSoup `find` and `find_all` methods

In [3]:
print("Using attributes:\n",      soup.h1.a)
print("Using the find method:\n", soup.find('h1').find('a'))

Using attributes:
 <a href="/workshop">
            Nick’s Workshop
            </a>
Using the find method:
 <a href="/workshop">
            Nick’s Workshop
            </a>


Note that this will always find the first instance which matches the specifications. To find all matches:

In [4]:
print("Using the find_all method:", soup.find_all('h1'))

Using the find_all method: [<h1>
<a href="/workshop">
            Nick’s Workshop
            </a>
</h1>, <h1>Welcome!</h1>]


We found the internal link. To separate it from its attribute tag using the `get` method.

In [5]:
soup.h1.a.get('href')

'/workshop'

### Bonus Tip 3

The `find_all` method actually returns a `ResultSet` object which behaves similarly to a list. This means we can't repeat `find_all` functions like we do with `find`.

In [6]:
try:
    soup.find_all('h1').find_all('a')
except Exception as err:
    print(err)

'ResultSet' object has no attribute 'find_all'


### Bonus Tip 4 
In practice, we must further specify tags by their class, id, etc. The SelectorGadget makes easy work of this. For the sake of time, we'll save explanation of how to use it for the data mining tutorial.

**Note**: The SelectorGadget is a chrome add-on. You can [check out the webpage and short tutorial](http://selectorgadget.com/) if you're really curious how to use it _now_. It's pretty intuitive if you're familiar with HTML classes and ids.

### Following Internal Links

We now know how to find internal links on a page. There are two things we want to do next:

1. Scrape the box scores for more game data;
2. Find scores from different dates. 

We'll switch gears now to scraping [NCAA basketball box scores on sports-reference](http://www.sports-reference.com/cbb/boxscores/index.cgi?month=02&day=03&year=2017) to demonstrate. 

Web pages contain absolute paths and relative paths. An absolute path contains the entire address to a link. A relative path contains the address relative to another base address. 

Internal links almost always contain relative paths. For example, the link to college basketball stats on [sports-reference](http://sports-reference.com) is "/cbb". The web pages knows this is an internal link and will send the user to http://sports-reference.com/cbb. 

To get the box score data, we must make a similar correction. 

In [7]:
# Make gamesheet soup
gamesheet_url = "http://www.sports-reference.com/cbb/boxscores/index.cgi?month=02&day=03&year=2017"
r = requests.get(gamesheet_url)
soup = BeautifulSoup(r.text, 'html.parser')

# Bonus Tip: Make ROOT_URL a global variable so you don't have to keep specifying it. 
ROOT_URL = "http://www.sports-reference.com"

# I used the SelectorGadget to learn the class "teams" corresponds to boxscores
for tag in soup.find_all(class_='right gamelink'):
    print(tag.a.get('href'))
#     box_score_url = ROOT_URL + tag.a.get('href')
#     print(box_score_url)

/cbb/boxscores/2017-02-03-ball-state.html
/cbb/boxscores/2017-02-03-central-michigan.html
/cbb/boxscores/2017-02-03-columbia.html
/cbb/boxscores/2017-02-03-cornell.html
/cbb/boxscores/2017-02-03-dartmouth.html
/cbb/boxscores/2017-02-03-davidson.html
/cbb/boxscores/2017-02-03-harvard.html
/cbb/boxscores/2017-02-03-monmouth.html
/cbb/boxscores/2017-02-03-rider.html


And there we go! Now we can just save the htmls for each link locally. 

### Bonus Tip 5

We may be tempted to navigate to other gamesheets the same way. But notice the URL takes date parameters.

> http://www.sports-reference.com/cbb/boxscores/index.cgi?month=01&day=22&year=2017 

We can just scrape the scores for a given date and use the date we specified. This solves the additional problem of navigating to other dates. It will also makes the code _much_ easier to read and debug later on.

### Bonus Tip 6

URL parameters rarely require a specific order. To make it more legible, we can flip the order of month, day, year to a less ambiguous format.

In [8]:
url = "http://www.sports-reference.com/cbb/boxscores/index.cgi?month={}&day={}&year={}"
print('American Format:     ', url.format(1,1,2017))

# we can even flip the parameters around to make our url less ambiguous for unamerican communists.
url = "http://www.sports-reference.com/cbb/boxscores/index.cgi?year={year}&month={month}&day={day}"
print("International Format:", url.format(year=2017,month=1,day=27))

American Format:      http://www.sports-reference.com/cbb/boxscores/index.cgi?month=1&day=1&year=2017
International Format: http://www.sports-reference.com/cbb/boxscores/index.cgi?year=2017&month=1&day=27


# Scraper Architecture

It'll be helpful to use our scraper as a class. Let's make a scraper class for web scraping from different urls.

Consider this docstring from the `Scraper` class in the `Scraper.py` file. 

```
Scraper:
    
    Attributes:
        _crawl_delay (int) -- Crawl delay to implement between requests.
        _encoding (str) -- Which encoding to use when writing html to disk. (default 'utf-8')
        _default_headers (dict) -- Default headers when not using VPN.
        _VPN_headers (dict) -- Headers to be used when using VPN.
    
    Methods:
        __init__ (None) -- Initialize scraper.
        set_crawl_delay (None) -- Set crawl delay to a positive int.
        make_soup (BeautifulSoup object) -- Make soup out of given url.
        write_html (None) -- Write given url to disc at specified path. 

    Todo:
        Give option for randomized crawl delay. Adds protection against bot detectors.
        Consider new procedures for different encodings. Converting r.content to soup for utf-8 is wasteful.
```

A scraper with the above structure allows the user to access all of its functionality at a glance. If we need to change anything about the scraper, we need only change one file - not two, three, four, or several. 

We'll do the same with our `Miner` class in the data mining workshop. There we'll tackle class inheritance. I promise that's more intimidating than difficult. 

# Invitation to Edit

Please copy any of the code in this repo and use it in your own code base. Better yet, make changes! Play with the code and make improvements. That's how you learn.