In [None]:
from IPython.core.display import HTML
# Most of this lecture is stolen from Cary, a little from Eric and some other scattered pieces

# What is web scraping?

The practice of gathering data through any means other than a program interacting with an API.

# Why do we web scrape?

Browsers are good for many things, but not so much for gathering information

- Sometimes the data we want isn't available packaged all nicely
- Sometimes the company doesn't want to make it easy for us to get all their data
- Sometimes the data source we need isn't big enough to warrant that company creating an API

# How do we do it?

If you can view it in your browser, you can grab it with a python script!

We'll discuss what packages you need when we get there.

# Is there an easier way?

API's are a thing. Sometimes they give us what we need and that's nice.

# What are the keys to a good scraper?

Some things to keep in mind when building your scraper:

- Save all the data you collect
- Don't abuse try:excepts
- Don't get banned from the site you're interested in

## Afternoon Objectives

1. Understand the process of getting data from the web.
2. Know the basics of HTML/CSS:
    * Know how to pull desired data from web pages.
3. Be able to use existing API's to get fetch pre-formatted data.

### Internet vs. World Wide Web

* The internet is commonly refered to as a network of networks. It is the infrastructure that allows networks all around the world to connect with one another. There are many different protocols to transfer information within this larger, meta-network.
* The World Wide Web, or Web, provides one of the ways that data can be transfered over the internet. Uses a **U**niform **R**esource **L**ocator, URL, to specify the location, within the internet, of a document.
* Documents on the web are generally written in **H**yper**T**ext **M**arkup **L**anguage, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### Communication on the Web

Information is transmitted around the web through a number of protocols. The main one that you will see is the **H**yper**T**ext **T**ransfer **P**rotocol, HTTP. These transfers, called **requests**, are initiated in a number of ways, but always begin with the client, read: you at your browser.

There are 4 main types of request that can be issued by your browser: get, post, put and delete. For web scraping purposes, you will almost always be using get requests. We will learn some more about the others in a couple of weeks during data products day.

### HTML & CSS
HTML, or *HyperText Markup Language*, is the language that virtually all web pages are written in.  HTML allows us to describe how web pages should render themselves and display content.  They do this by using *HTML Tags*, where each tag is describing different document content.  For example we could specify the title of our page with the following tags: `<title>This is the Title!</title>`.  `<title>` is the opening tag and `</title>` is the corresponding closing tag.  Anything that falls in the middle will be interpreted as the title of the document.  For a more in-depth introduction to HTML, please refer to [this w3schools Tutorial](http://www.w3schools.com/html/html_intro.asp).

We can also apply *attributes* to HTML tags to control how they will be stylized.  Let's say we would like to change the color of our super original title tag to red to make it pop.  We could easily achieve this with the following style attribute: `<title style="color:red;">This is a red Title!</title>`.  But what if we have a very large website and want to change the look and feel of the entire thing?  Certainly there has to be a better way than manually manipulating individual tags?  

This can easily be achieved with the use of *Cascading Style Sheets*, or CSS.  Typically CSS is defined in a separate style sheet file with a `.css` extension which controls how certain tags are stylized.  For example, the following CSS block would make any `<p>` tags have red text and be in the courier font:

```css
p  {
    color: red;
    font-family: courier;
}
```

This is all well and good, but our goal isn't to build websites, it's to systematically extract information from them!  To this end, CSS selectors provide a succinct way of only selecting certain aspects from a potentially giant block of HTML.

#### CSS Selectors
To really see the power of CSS selectors we need to see them in action!  To that end, try getting through the first 10-15 levels of the [Game of Fruit](http://flukeout.github.io/).

You can also refer to this [CSS Selector Cheatsheet](http://www.cheetyr.com/css-selectors)

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mindbogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [None]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2016/06/25/')

In [None]:
r.text[:1000] # First 1000 characters of the HTML

### Getting Info from a Web Page with BeautifulSoup

Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this. With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

“Beautiful Soup, so rich and green,

Waiting in a hot tureen!

Who for such dainties would not stoop?

Soup of the evening, beautiful Soup!”

--Lewis Carroll

In [None]:
import requests
import re
from bs4 import BeautifulSoup

In [None]:
r = requests.get('http://www.espn.com/college-football/statistics/player/_/stat/passing/sort/passingYards/year/2015/qualified/false/count/1')

In [None]:
bs_obj = BeautifulSoup(r.text, 'html.parser')

In [None]:
bs_obj.findAll('tr',{'class':'evenrow','class':'oddrow'})

In [None]:
bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')})

In [None]:
for obj in bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')}):
    print obj.find('span')['title']

In [None]:
for obj in bs_obj.findAll('tr',{'class':re.compile('^(evenrow|oddrow)')}):
    html = requests.get(obj.find('a')['href'])
    bs_obj2 = BeautifulSoup(html.text, 'html.parser')
    name = bs_obj2.h1.get_text()
    born = bs_obj2.find('ul',{'class':'player-metadata'}).findAll('li')[0].get_text()
    if born[0] == 'B':
        print '{} Born: {}'.format(name,born[4:])

# Basic Authentication

In [None]:
import requests
z = requests.get('http://galvanizesf.roomzilla.net')
z

In [None]:
import requests
z = requests.get('http://galvanizesf.roomzilla.net', auth=('', 'gVIP543'))
z

In [None]:
HTML(z.content)

Sometimes they hide the info from us, but we can find it being sneaky...
http://m.mlb.com/scoreboard#date=11/1/2016

As you go through a web site you should build up a dictionary for the documents that you want to store in Mongo. In the example above we may, for each post url, create a dictionary with the information:
```python
    { url: url_of_event,
      date: date_event,
      cost: cost_of_event }
```

We can then insert these dictionaries into a Mongo database via PyMongo, which we will learn about next.

# Scraping from an Existing API

Let's take a look at the API for all the publically avaliable policing data in the [UK](https://data.police.uk/docs/). After taking a look at the documentation for the interface, let's experiment with what we get when we issue a request to this API. The process looks remarkable similar to the one we went through for scraping a web page, except this time the response we're looking for is avaliable via the `json()` method.

In [None]:
r = requests.get('https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2013-01')
r.json()[:2]

## API Scraping and Mongo

Many APIs will give you a choice of how it will return data to you, choosing json will make life easier since we will frequently be using Mongo for our storage unit during our scraping endeavors, and it plays very well with json. 

Interacting with Mongo from Python is done with the other Mongo client that we talked about earlier PyMongo. It is designed to have a similar interface as the Mongo shell does, this ends up being fairly intuitive since both Python and JavaScript are object oriented languages, and therefore store and refer to things in a similar manner.

In [None]:
from pymongo import MongoClient

client = MongoClient()
db = client.uk_police
collection = db.all_crime

In [None]:
other_request = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date=2013-09')

In [None]:
other_request.json()[:2]

In [None]:
# Possible way to grab data for range of months and years
for year in range(2001, 2016):
    for month in range(1, 13):
        r = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date={}-{}'.format(year, month))
        collection.insert_many(r.json())

In [None]:
collection.insert_many(other_request.json())

In [None]:
import pprint as pp
for item in collection.find({ 'category' : 'public-order' }):
    pp.pprint(item)

In [None]:
# Remember to close the connection
client.close()

# Wikipedia API

In [None]:
# import the Requests HTTP library
import requests
import json
import re

# A User agent header required for the Wikipedia API.
headers = {'user_agent': 'Web_Scraping/1.1 (darren.reger@galvanize.com; dsi example exercise)'}

In [None]:
# Experiment with fetching one or two pages and examining the result (fill in URL and payload)
url = 'https://en.wikipedia.org/w/api.php'

# parameters for the API request
payload = { 'action' : 'parse' , 'format' : 'json','page' : 'Kevin Bacon' }

# make the request
r = requests.post(url, data=payload, headers=headers)

# print out the result of the request as JSON
print r.json()['parse']

# this is the same as going to https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Kevin%20Bacon

In [None]:
print r.text

# Downloading an image or file
We use urlretrieve

In [None]:
from urllib import urlretrieve

In [None]:
urlretrieve('https://images.craigslist.org/00Y0Y_kho6IzfhxVn_600x450.jpg', filename='cltest.jpg')

## Leftover from Cary's Lecture in case the ESPN live demo goes poorly

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
soup.find_all('a', rel=True)[:10]

In [None]:
soup.select('h2.title')

In [None]:
title = soup.find_all('h2', class_='title')[0]

In [None]:
good_clear_float = title.next_sibling.next_sibling

In [None]:
urls = []
for tag in good_clear_float.find_all('a', rel=True):
    urls.append(tag.attrs['href'])

In [None]:
urls