# Web Scraping!!!!

(stolen from Cary...)

## Morning Objectives

1. Understand motivation for web scraping:
    * What does a web data pipeline look like?
    * How should we store data from the web?
2. Know high level differences between NoSQL and SQL.
3. Perform basic operations using Mongo Shell.

<div style="text-align: center"><h3>The Reality of Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internets to find interesting data:
    * From an existing company
    * Text for NLP
    * Images
    <div style="text-align: center"><h3>Web Data Pipeline</h3><img src="images/web_data_pipeline.png" style="width: 600px"></div>

## Storing data from the web

* We already know how to store data -> SQL (RBDMS).
    * Why wouldn't SQL necessarily be the best tool for storing data that we retrieve from the web?
        * Data are messy!
* Enter No SQL. Stands for **N**ot **o**nly **SQL**. MongoDB is a flavor of NoSQL, like PosgreSQL is a flavor of SQL.
    * A NoSQL paradigm may be preferable to SQL because it is schemaless.
    * Great for storing unstructured data, as we may find on the web!
    * MongoDB is a document-oriented DBMS:
      <div style="text-align: center"><h3>Centered around "Documents"</h3><img src="images/document_based_storage.png" style="width: 600px"></div>

## SQL vs. Mongo

* SQL - want to prevent redundancy in data by having tables with unique information and relations between them (normalized data).
    * Creates a framework for querying with joins.
    * Makes it easier to update database. Only ever have to change information in a single place.
    * This can result in "simple" queries being slower, but more complex queries are often faster.
* Mongo - document based storage system. Does not enforce normalized data. Can have data redundancies in documents (denormalized data).
    * No joins.
    * A change to database generally results in needing to change many documents.
    * Since there is redundancy in the documents, simple queries are generally faster. But complex queries are often slower.
    

|         | SQL          | Mongo          |
|---------|--------------|----------------|
| Schema  | Yes => Joins | No => No Joins |
| Storage | Table        | Collection     |
|         | Row          | Document       |
|         | Column       | Field          |

## Connecting to Mongo

In practice, there two main ways that you will be connecting with mongo:

* From Python
* From the console - shell
    
<div style="text-align: center"><img src="images/mongo_clients.png" style="width: 600px"></div>

Both of these clients require a Mongo server to be running. In practice this will require you to start a Mongo Daemon process. To do this execute the command `mongod` at the terminal.

* Note: The Mongo Daemon will need to occupy the terminal that you started it in for the life of the server session. Read: run `mongod` in a separate terminal tab (or tmux).

Now we're going to explore the interacting with the mongo shell. This is not the main way that you'll be interacting with mongo while scraping, we have Python for that. But it's good to know how to issue queries from the shell for various reasons.

# Mongo Shell Demo Code

## Using Mongo - General Commands for Inspecting Mongo

```javascript
help                        // List top level mongo commands

db.help()                   // List database level mongo commands

db.<collection name>.help() // List collection level mongo commands.

show dbs                    // Get list of databases on your system

use <database name>         // Change the database that you're current using

show collections            // Get list of collections within the database that you're currently using
```

## Inserting

Once you're using a database you refer to it with the name **db**. Collections within databases are accessible through dot notation.

```javascript
db.users.insert({ name: 'Jon', age: '45', friends: [ 'Henry', 'Ashley']})

db.getCollectionNames()  // Another way to get the names of collections in current database

db.users.insert({ name: 'Ashley', age: '37', friends: [ 'Jon', 'Henry']})
db.users.insert({ name: 'Frank', age: '17', friends: [ 'Billy'], car : 'Civic'})

db.users.find()

    { "_id" : ObjectId("573a39"), "name" : "Jon", "age" : "45", "friends" : [ "Henry", "Ashley" ] }
    { "_id" : ObjectId("573a3a"), "name" : "Ashley", "age" : "37", "friends" : [ "Jon", "Henry" ] }
    { "_id" : ObjectId("573a3b"), "name" : "Frank", "age" : "17", "friends" : [ "Billy" ], "car" : "Civic" }
```

Things to note:
* The three documents that we inserted into the above database didn't all have the same fields.
* Mongo creates an ` _id` field for each document if one isn't provided.

## Querying

```javascript
db.users.find({ name: 'Jon'})                       // find by single field

db.users.find({ car: { $exists : true } })          // find by presence of field

db.users.find({ friends: 'Henry' })                 // find by value in array

db.users.find({}, { name: true })                   // field selection (only return name)
```

A quick way to figure out how to write a Mongo query is to think about how you would do it in SQL and check out a resource like this Mongo endorsed [conversion guide](https://docs.mongodb.com/manual/reference/sql-comparison/#create-and-alter), or use something like a [query translator](http://www.querymongo.com/).

## Updating

```javascript
db.users.update({name: "Jon"}, { $set: {friends: ["Phil"]}})            // replaces friends array

db.users.update({name: "Jon"}, { $push: {friends: "Susie"}})            // adds to friends array

db.users.update({name: "Stevie"}, { $push: {friends: "Nicks"}}, true)   // upsert

db.users.update({}, { $set: { activated : false } }, false, true)       // multiple updates
```

## Imports and Cursors

To import existing data into a mongo database one uses `mongoimport` at the command line. In this way mongo will accept a number of data types: JSON, CSV, and TSV.

```
mongoimport --db tweets --collection coffee --file coffee-tweets.json
```

Now that we have some larger data we can see that returns from queries are not always so small.

```javascript
use tweets
db.coffee.find()
```

When the return from a query will display up to the first 20 documents, after that you will need to type `it` to get more. The cursor that it returns is actually an object that has many methods implemented on it and supports the command `it` to iterate through more return items.

```javascript
db.coffee.find().count()      // 122

db.coffee.find().limit(2)     // Only two documents

db.coffee.find().sort({ 'user.followers_count' : -1}).limit(3)  // Top three users by followers count
```

## Iteration

MongoDB also has a flexible shell/driver. This allows you take some action based on a query or update documents. You can use an iterator on the cursor to go document by document. In the Javascript shell we can do this with Javascript's `forEach`. `forEach` is similar to Python's iteration with the `for` loop; however, Javascript actually has a more functional approach to this type of iteration and requires that you pass a callback, a function, similar to `map` and `reduce`.

```javascript
db.coffee.find().forEach(function(doc) {
    doc.entities.urls.forEach(function(url) {
        db.urls.update({ 'url': url }, { $push: { 'user': doc.user } }, true)
    });
});
```

## Aggregation

Aggregations in Mongo end up being way less pretty than in SQL/Pandas. Let's just bite the bullet and take a look:

```
db.coffee.aggregate( [ { $group :
    {
        _id: "$filter_level",
        count: { $sum: 1 }
    }
}])
```

Here we are first declaring that we're going to do some sort of grouping operation. Then, as Mongo desires everything to have an `_id` field, we specify that the `_id` is going to be the filter level. And then we're going to perform a sum over each level counting 1 for each observation. This information is going to be stored in a field called `count`. What do we get back?

We can also do more complicated stuff as well. Here's a query that returns the average number of friends users in this dataset by country. We need to access the country code field of the place field, but that is easy with an object oriented language like JS.

```
db.coffee.aggregate( [ { $group :
    {
        _id: "$place.country_code",
        maxFriendCount: { $max: "$user.friends_count" }
    }
}])
```

For a guide on how to convert from an SQL style aggregation to a Mongo style aggregation, check out this [aggregation conversion guide](https://docs.mongodb.com/manual/reference/sql-aggregation-comparison).

## Afternoon Objectives

1. Understand the process of getting data from the web.
2. Know the basics of HTML/CSS:
    * Know how to pull desired data from web pages.
3. Be able to use existing API's to get fetch pre-formatted data.

### Internet vs. World Wide Web

* The internet is commonly referred to as a network of networks. It is the infrastructure that allows networks all around the world to connect with one another. There are many different protocols to transfer information within this larger, meta-network.
* The World Wide Web, or Web, provides one of the ways that data can be transferred over the internet. Uses a **U**niform **R**esource **L**ocator, URL, to specify the location, within the internet, of a document.

    <div style="text-align: center"><h3>Anatomy of a URL</h3><img src="images/url.png" style="width: 600px"></div>
    
* Documents on the web are generally written in **H**yper**T**ext **M**arkup **L**anguage, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### Communication on the Web

Information is transmitted around the web through a number of protocols. The main one that you will see is the **H**yper**T**ext **T**ransfer **P**rotocol, HTTP. These transfers, called **requests**, are initiated in a number of ways, but always begin with the client, read: you at your browser.

 <div style="text-align: center"><h3>Requests in Action</h3><img src="images/requests.png" style="width: 600px"></div>
 
There are 4 main types of request that can be issued by your browser: get, post, put and delete. For web scraping purposes, you will almost always be using get requests. We will learn some more about the others in a couple of weeks during data products day.

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page. For each of these tasks we have a Python library, `requests` and `bs4`, respectively.

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [1]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2016/06/25/')

In [2]:
r.text[:1000] # First 1000 characters of the HTML

u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">\r\n\r\n<head profile="http://gmpg.org/xfn/11">\r\n<script src="//cdn.optimizely.com/js/195632799.js"></script>\r\n\r\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\r\n\r\n\r\n<title>Events for June 25, 2016 Archives - FunCheapSF.com</title>\r\n\r\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\r\n\r\n<link rel="stylesheet" href="http://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.6" type="text/css" media="screen" />\r\n<link rel="stylesheet" href="http://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\r\n<!--[if IE 6]>\r\n    <style type="text/css">\r\n    body {\r\n        behavior:ur

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this. With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [3]:
html = '''<!DOCTYPE html>
<html>
<head>
<title>The title of this web page</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.png">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.png" alt="Venice"> <br />
<img src="venice2.png" alt="Venice"> <br />
<img src="rome.png" alt="Roma">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.png" alt="Berlin">
</div>
</body>
</html>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

In [4]:
soup.title

<title>The title of this web page</title>

In [5]:
soup.title.string

u'The title of this web page'

In [6]:
soup.h1

<h1>My Photos</h1>

In [7]:
soup.h3

<h3>Italy</h3>

In [8]:
soup.find('h3')

<h3>Italy</h3>

In [9]:
soup.find_all('h3')

[<h3>Italy</h3>, <h3>Germany</h3>]

In [10]:
soup.find_all('h3')[1].string

u'Germany'

In [11]:
soup.find_all('div', class_='country')

[<div class="country">\n<img alt="Venice" src="venice1.png"> <br/>\n<img alt="Venice" src="venice2.png"> <br/>\n<img alt="Roma" src="rome.png">\n</img></img></img></div>,
 <div class="country">\n<img alt="Berlin" src="berlin.png">\n</img></div>]

In [12]:
soup.find_all('img', alt='Venice')

[<img alt="Venice" src="venice1.png"> <br/>\n<img alt="Venice" src="venice2.png"> <br/>\n<img alt="Roma" src="rome.png">\n</img></img></img>,
 <img alt="Venice" src="venice2.png"> <br/>\n<img alt="Roma" src="rome.png">\n</img></img>]

In [13]:
soup.find('div', class_='country').find_previous_siblings('h3')

[<h3>Italy</h3>]

If I wanted to systematically get a list of all of the countries visited, how would I do it?

In [14]:
for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    print country

Italy
Germany


In [15]:
for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    for img in div.find_all('img'):
        image = img.get('src')
        print 'Country: {}: image: {}'.format(country, image)

Country: Italy: image: venice1.png
Country: Italy: image: venice2.png
Country: Italy: image: rome.png
Country: Germany: image: berlin.png


### Getting Info from a Web Page - Take 2

In [16]:
soup = BeautifulSoup(r.text, 'html.parser')

In [17]:
soup.select('h2.title')[0].string

u'Events for  June 25, 2016'

In [18]:
title = soup.find_all('h2', class_='title')[0]

title

<h2 class="title">Events for  June 25, 2016</h2>

In [19]:
good_clear_float = title.next_sibling.next_sibling

good_clear_float

<div class="clearfloat">\n<span class="left"><a href="/2016/06/24/">&lt; Friday, June 24</a></span>\n<span class="right"><a href="/2016/06/26/">Sunday, June 26 &gt;</a></span>\n<div style="clear:both"></div>\n<div class="archive_date_title" style="background-color:white;margin:0px;margin-bottom:4px;padding:0px;"><h3 style="background-color:black;color:white;margin:0px;padding:3px;font: 16px Arial;font-weight:bold;">Saturday<span style="font-weight:normal;">, June 25</span></h3></div>\n<div class="tanbox left" style="background-color:white;">\n<span class="title"><a href="http://sf.funcheap.com/set-prides-pink-triangle-twin-peaks/" rel="bookmark" title="Giant Pink Triangle on Twin Peaks: Set-Up &amp; Ceremony | Pride 2016">Giant Pink Triangle on Twin Peaks: Set-Up &amp; Ceremony | Pride 2016</a></span>\n<div class="meta archive-meta">Saturday, June 25 \x96 7:00 am | \n\n\n<span class="cost">Cost:</span> <a class="tt">FREE* \n\t\t<div class="tooltip">\n<div class="top"></div>\n<div class

In [20]:
urls = []
for tag in good_clear_float.find_all('a', rel=True):
    href = tag.attrs['href']
    urls.append(href)

In [21]:
urls

[u'http://sf.funcheap.com/set-prides-pink-triangle-twin-peaks/',
 u'http://sf.funcheap.com/set-prides-pink-triangle-twin-peaks/',
 u'http://sf.funcheap.com/1-flip-flop-day-navy/',
 u'http://sf.funcheap.com/1-flip-flop-day-navy/',
 u'http://sf.funcheap.com/summer-free-yoga-series-santana-row/',
 u'http://sf.funcheap.com/summer-free-yoga-series-santana-row/',
 u'http://sf.funcheap.com/pro-beach-volleyball-tournament-avp-san-francisco-open-piers-3032-3/',
 u'http://sf.funcheap.com/pro-beach-volleyball-tournament-avp-san-francisco-open-piers-3032-3/',
 u'http://sf.funcheap.com/environmental-candlestick-point-nature-hike/',
 u'http://sf.funcheap.com/surf-city-classic-woodies-wharf-santa-cruz/',
 u'http://sf.funcheap.com/surf-city-classic-woodies-wharf-santa-cruz/',
 u'http://sf.funcheap.com/dyke-march-playground-brunch-dolores-park/',
 u'http://sf.funcheap.com/nor-cal-corgi-con-summer/',
 u'http://sf.funcheap.com/nor-cal-corgi-con-summer/',
 u'http://sf.funcheap.com/history-san-francisco-pr

### Very cool resource for learning about CSS selectors: http://flukeout.github.io/

As you go through a web site you should build up a dictionary for the documents that you want to store in Mongo. In the example above we may, for each post url, create a dictionary with the information:
```python
    { url: url_of_event,
      date: date_event,
      cost: cost_of_event }
```

We can then insert these dictionaries into a Mongo database via PyMongo, which we will learn about next.

# Scraping from an Existing API

Let's take a look at the API for all the publically available policing data in the [UK](https://data.police.uk/docs/). After taking a look at the documentation for the interface, let's experiment with what we get when we issue a request to this API. The process looks remarkable similar to the one we went through for scraping a web page, except this time the response we're looking for is available via the `json()` method.

In [22]:
r = requests.get('https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2013-01')
r.json()[:2]

[{u'category': u'anti-social-behaviour',
  u'context': u'',
  u'id': 20605700,
  u'location': {u'latitude': u'52.624477',
   u'longitude': u'-1.112399',
   u'street': {u'id': 882437, u'name': u'On or near Mundella Street'}},
  u'location_subtype': u'',
  u'location_type': u'Force',
  u'month': u'2013-01',
  u'outcome_status': None,
  u'persistent_id': u''},
 {u'category': u'anti-social-behaviour',
  u'context': u'',
  u'id': 20605689,
  u'location': {u'latitude': u'52.629264',
   u'longitude': u'-1.154764',
   u'street': {u'id': 883534, u'name': u'On or near Westcotes Drive'}},
  u'location_subtype': u'',
  u'location_type': u'Force',
  u'month': u'2013-01',
  u'outcome_status': None,
  u'persistent_id': u''}]

In [23]:
crime_stuff = r.json()

## API Scraping and Mongo

Many APIs will give you a choice of how it will return data to you, choosing json will make life easier since we will frequently be using Mongo for our storage unit during our scraping endeavors, and it plays very well with json. 

Interacting with Mongo from Python is done with the other Mongo client that we talked about earlier PyMongo. It is designed to have a similar interface as the Mongo shell does, this ends up being fairly intuitive since both Python and JavaScript are object oriented languages, and therefore store and refer to things in a similar manner.

In [24]:
from pymongo import MongoClient

client = MongoClient()
db = client.uk_police
collection = db.all_crime

In [25]:
other_request = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date=2013-09')

In [26]:
other_request.json()[:2]

[{u'category': u'burglary',
  u'context': u'',
  u'id': 26993975,
  u'location': None,
  u'location_subtype': u'',
  u'location_type': None,
  u'month': u'2013-09',
  u'outcome_status': {u'category': u'Investigation complete; no suspect identified',
   u'date': u'2013-10'},
  u'persistent_id': u'601d1a058fb87207bfea500802ad9043fc9629fae479d0a9c3d2abd5b1bbe14d'},
 {u'category': u'burglary',
  u'context': u'',
  u'id': 26994099,
  u'location': None,
  u'location_subtype': u'',
  u'location_type': None,
  u'month': u'2013-09',
  u'outcome_status': {u'category': u'Investigation complete; no suspect identified',
   u'date': u'2013-10'},
  u'persistent_id': u'34990376c7cb84ede03c06f3dda76cda7fd63cdaa9f2179213d9aa2d53b9d9e6'}]

In [27]:
# Possible way to grab data for range of months and years
for year in range(2013, 2014):
    for month in range(1, 13):
        print 'Scraping year/month: {}/{}'.format(year, month)
        r = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=warwickshire&date={}-{}'.format(year, month))
        collection.insert_many(r.json())

Scraping year/month: 2013/1
Scraping year/month: 2013/2
Scraping year/month: 2013/3
Scraping year/month: 2013/4
Scraping year/month: 2013/5
Scraping year/month: 2013/6
Scraping year/month: 2013/7
Scraping year/month: 2013/8
Scraping year/month: 2013/9
Scraping year/month: 2013/10
Scraping year/month: 2013/11
Scraping year/month: 2013/12


In [28]:
collection.insert_many(other_request.json())

<pymongo.results.InsertManyResult at 0x107c4f280>

In [29]:
import pprint as pp
for item in collection.find({ 'category' : 'public-order' }):
    pp.pprint(item)

{u'_id': ObjectId('589e0e160310e901de42e783'),
 u'category': u'public-order',
 u'context': u'',
 u'id': 24587102,
 u'location': None,
 u'location_subtype': u'',
 u'location_type': None,
 u'month': u'2013-06',
 u'outcome_status': {u'category': u'Court result unavailable',
                     u'date': u'2014-04'},
 u'persistent_id': u'0842db58ea7e0298d7528aa18d103f3b34e249435ba4c1c7feffd65da5f4b8d9'}
{u'_id': ObjectId('589e0e170310e901de42e78b'),
 u'category': u'public-order',
 u'context': u'',
 u'id': 25758751,
 u'location': None,
 u'location_subtype': u'',
 u'location_type': None,
 u'month': u'2013-07',
 u'outcome_status': {u'category': u'Under investigation', u'date': u'2013-07'},
 u'persistent_id': u'a7d07d4cd6ba0f4a886ba12d0ff48a43b43039535eaa6a5b13569f1936cda98e'}
{u'_id': ObjectId('589e0e190310e901de42e7a0'),
 u'category': u'public-order',
 u'context': u'',
 u'id': 26992793,
 u'location': None,
 u'location_subtype': u'',
 u'location_type': None,
 u'month': u'2013-09',
 u'outcome_

In [30]:
# Remember to close the connection
client.close()