## Afternoon Objectives

1. Understand the process of getting data from the web.
2. Know the basics of HTML/CSS:
    * Know how to pull desired data from web pages.
3. Be able to use existing API's to get fetch pre-formatted data.

### Internet vs. World Wide Web

* The internet is commonly refered to as a network of networks. It is the infrastructure that allows networks all around the world to connect with one another. There are many different protocols to transfer information within this larger, meta-network.
* The World Wide Web, or Web, provides one of the ways that data can be transfered over the internet. Uses a **U**niform **R**esource **L**ocator, URL, to specify the location, within the internet, of a document.

    <div style="text-align: center"><h3>Anatomy of a URL</h3><img src="images/url.png" style="width: 600px"></div>
    
* Documents on the web are generally written in **H**yper**T**ext **M**arkup **L**anguage, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### Communication on the Web

Information is transmitted around the web through a number of protocols. The main one that you will see is the **H**yper**T**ext **T**ransfer **P**rotocol, HTTP. These transfers, called **requests**, are initiated in a number of ways, but always begin with the client, read: you at your browser.

 <div style="text-align: center"><h3>Requests in Action</h3><img src="images/requests.png" style="width: 600px"></div>
 
There are 4 main types of request that can be issued by your browser: get, post, put and delete. For web scraping purposes, you will almost always be using get requests. We will learn some more about the others in a couple of weeks during data products day.

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mindbogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.


In [None]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2016/06/25/')

In [None]:
r.text[:1000] # First 1000 characters of the HTML

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this. With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
soup.find_all('a', rel=True)[:10]

In [None]:
soup.select('h2.title')

In [None]:
title = soup.find_all('h2', class_='title')[0]

In [None]:
good_clear_float = title.next_sibling.next_sibling

In [None]:
urls = []
for tag in good_clear_float.find_all('a', rel=True):
    urls.append(tag.attrs['href'])
urls

### Very cool resource for learning about CSS selectors: http://flukeout.github.io/

As you go through a web site you should build up a dictionary for the documents that you want to store in Mongo. In the example above we may, for each post url, create a dictionary with the information:
```python
    { url: url_of_event,
      date: date_event,
      cost: cost_of_event }
```

We can then insert these dictionaries into a Mongo database via PyMongo, which we will learn about next.

# Scraping from an Existing API

Let's take a look at the API for all the publically avaliable policing data in the [UK](https://data.police.uk/docs/). After taking a look at the documentation for the interface, let's experiment with what we get when we issue a request to this API. The process looks remarkable similar to the one we went through for scraping a web page, except this time the response we're looking for is avaliable via the `json()` method.

In [None]:
r = requests.get('https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2013-01')
r.json()[:2]

In [None]:
crime_stuff = r.json()

## API Scraping and Mongo

Many APIs will give you a choice of how it will return data to you, choosing json will make life easier since we will frequently be using Mongo for our storage unit during our scraping endeavors, and it plays very well with json. 

Interacting with Mongo from Python is done with the other Mongo client that we talked about earlier PyMongo. It is designed to have a similar interface as the Mongo shell does, this ends up being fairly intuitive since both Python and JavaScript are object oriented languages, and therefore store and refer to things in a similar manner.

In [None]:
from pymongo import MongoClient

client = MongoClient()
db = client.uk_police
collection = db.all_crime

In [None]:
other_request = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date=2013-09')

In [None]:
other_request.json()[:2]

In [None]:
# Possible way to grab data for range of months and years
for year in range(2001, 2016):
    for month in range(1, 13):
        r = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date={}-{}'.format(year, month))
        collection.insert_many(r.json())

In [None]:
collection.insert_many(other_request.json())

In [None]:
import pprint as pp
for item in collection.find({ 'category' : 'public-order' }):
    pp.pprint(item)

In [None]:
# Remember to close the connection
client.close()