# Web Scraping and HTML Concepts


### Morning Objectives:
* Install `mongo` and `pymongo`.
* *Compare and Contrast* SQL and noSQL.
* *Perform* basic operations using Mongo.

### Afternoon Objectives:
* *Describe* a typical web scraping data pipeline.
* *Explain* the basic concepts of HTML.
* *Learn how to* write code to pull elements from a web page using BeautifulSoup.
* *Use* an existing API to fetch data and parse using BeautifulSoup.

## Resources

* [Precourse-Web Awareness](https://github.com/zipfian/precourse/tree/master/Chapter_8_Web_Awareness)
* [The Little MongoDB Book](http://openmymind.net/mongodb.pdf)
* [w3 schools](http://www.w3schools.com/) : HTML tags and thier attributes.
* [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html)
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)

## Installing Mongo and PyMongo

### Mongo
1. Install MongoDB: `brew install mongodb`
2. Start MongoDB: `brew services start mongodb`

#### Do *not* run services as `root`.  Ever.  Even if someone tells you to.

### PyMongo
2. Install PyMongo: `conda install pymongo`

### SQL vs NoSQL

NoSQL does not stand for 'No SQL'. SQL is useful for many things, it's not going away.

> NoSQL ==> "Not Only SQL"

It's a different Paradigm to deal with messy data that does not lend itself to an RDBMS.  It's also very useful as a quick and painless solution to data storage, where a full relational database model takes much thought and investment.


| Topic | SQL | NoSQL |
| --- | --- | --- |
| Database | Database (Schema for Oracle) | Database |
| Table | Table | Collection |
| Column | Columns | Document keys |

**Not neccessary for NoSQL to have the same keys for each document in a collection -- whaaaaat?!

Today we're talking about Mongo, but here is a high level overview of the Not Only SQL space

<img src="images/noSQL.png" width = 500>

### Mongo Clients

The command line program we use to interact with mongo is a *client*.  It's only job is to send messages to another program, a *server*, which holds all our data and knows how to operate on it.

The command line Mongo client is written in javascript, so interacting with mongo with this client looks like writing javascript code.

<img src="images/client-server.png" width = 500>

There are other clients.  Later on we will use `pymongo` to interact with our databases from python.

## Javascript Object Notation

Javascript Object Notation, or JSON, is a simple data storage and communication protocall.  It was designed by [Douglas Crockford](https://en.wikipedia.org/wiki/Douglas_Crockford) based on the notation Javascipt uses for objects.

It is meant as a replacement to XML.

```javascript
{
    name: 'TwilightSparkle',
    friends: ['Applejack', 'Fluttershy'],
    age: 16,
    gender: 'f',
    wings: true,
    horn: true,
    residence: {
        town: 'Ponyville',
        address: '15 Gandolfini Lane'}
}
```

It is very similar to a python dictionary literal, but you *cannot use double quotes to enclose strings*.

## Working with Mongo DB

### MongoDB Concepts

#### What's it about? 

* MongoDB is a document-oriented database, an alternative to RDBMS, used for storing semi-structured data.
* JSON-like objects form the data model, rather than RDBMS tables.
* No schema, No joins, No transactions.
* Sub-optimal for complicated queries.

#### Structure of the database.

* MongoDB is made up of databases which contain collections (tables).
* A collection is made up of documents (analogous to rows or records).
* Each document is a JSON object made up of key-value pairs (analogous to columns).


So a RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

I created a `unicorns.json` file that can be imported into MongoDB.

```
mongoimport --db unicorns --collection unicorns < unicorns.json
```

**Note**: If you are using linux, you max need to add a switch `--jsonArray`.

Now start mongo. 

A MongoDB contains a collection of databases, so lets check that the `unicorns` database exists.

```
> show dbs
```

To use the `unicorns` database, we simply do the following:

```
> use unicorns
```

A database is made of `collection`s, which are containers for the actual stored data.  A `collection` would be analagous to a `table` in a classical relational database, but can contain much more flexible data than a table.

```
> db.getCollectionNames()
```

### Inserting Data
```javascript
db.unicorns.insert({
    name: 'Applejack',
    age: 15,
    friends: ['TwilightSparkle', 'Fluttershy'],
    wings: false,
    horn: false
})

db.unicorns.insert({
    name: 'Fluttershy',
    age: 15,
    friends: ['Applejack', 'TwilightSparkle'],
    wings: true,
    horn: false
})
                 
```

## Querying Data

Without any arguments, `find` dumps all the data in the collection. This is like the SQL where clause. 

```javascript
db.unicorns.find()
```

`find` is much more flexible.

```javascript
// find by single field
db.unicorns.find({name: 'TwilightSparkle'})

// find by presence of field
db.unicorns.find({friends: {$exists : true}})

// find by value in array
db.unicorns.find({friends: 'TwilightSparkle'})

// To return only certain fields
// This says, return only the names of unicorns who are friends with
// twilight sparkle.
db.unicorns.find({friends: 'TwilightSparkle'}, {name: true})

Above the first dictionary is called a selector (makes sense) and the second is a projection. 

Chaining criteria together above is like using AND in SQL you would use OR using the syntax below


db.unicorns.find({gender: 'f', $or: [{loves: 'apple'}, {weight: {$lt: 500}}]})

**Excercises**: 

<details><summary>
Q: Find all the unicorns with wings.
</summary>
db.unicorns.find({'winds':true})
</details>

<details><summary>
Q: Find only the friends of unicorns with wings
</summary>
db.unicorns.find({'wings':true}, {'friends':true, _id:false})
</details>

<details><summary>
Q: Return only those with friends (no blank arrays)
</summary>
db.unicorns.find({'wings':true, friends:{$ne:[]}}, {'friends':true, name:true, _id:false})
</details>


## Updating Data

```javascript
// Replaces friends array
db.unicorns.update({
    name: 'TwilightSparkle'}, {
    $set: {
        friends: ['Shutterfly', 'Rarity', 'Applejack']}})

// Adds to friends array
db.unicorns.update({
    name: 'Applejack'}, {
    $push: {
        friends: 'Rarity'}})
```

We have to use the `$set` and `$push` operators, the default behaviour of `update` is to *replace the data*.

```javascript
// Replaces the TwighlightSparkle data completely!
// It will no longer have even a name field after this!
db.unicorns.update({
    name: 'TwilightSparkle'}, {
    friends: ['Shutterfly', 'Rarity', 'Applejack']})
```


An `upsert` either creates a document (when it does not already exist) or inserts into an existing document.

```
// Upsert: This one is created
db.unicorns.update({
    name: "Rarity"}, {
    $push: {
        friends: {
            $each: ["TwilightSparkle", "Applejack", "Fluttershy"]}}}, {
    upsert: true})

// Upsert: This one is updated
db.unicorns.update({
    name: "Fluttershy"}, {
    $push: {
        friends: {
            $each: ["Rarity", "PrincessCelestia"]}}}, {
    upsert: true})
```

**Excercise**: Enter a unicorn named RainbowDash into the database that is friends with TwilightSparkle, Rarity, and Applejack.

## Deleting Data

*Don't run this one!*

```javascript
db.unicorns.remove({})
```

## PyMongo


`pymongo` allows python to connect to and manipulate MongoDB.

In [1]:
from pymongo import MongoClient
import pprint

In [2]:
# Connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

In [3]:
db = client.unicorns

In [4]:
# Create a collection called unicorn
unicorns = db.unicorns

In [5]:
unicorns.insert_one({
    'name': 'RainbowDash', 
    'age': 16, 
    'friends': ['TwilightSparkle', 'Applejack', 'Rarity']})

<pymongo.results.InsertOneResult at 0x116078d20>

In [8]:
unicorns.find().count()

9

In [9]:
print(unicorns.find_one())

{u'name': u'TwilightSparkle', u'gender': u'f', u'age': 16, u'horn': True, u'residence': {u'town': u'Ponyville', u'address': u'15 Gandolfini Lane'}, u'_id': ObjectId('598cf4506ac6234d87c88cf1'), u'friends': [u'Applejack', u'Fluttershy'], u'wings': True}


In [10]:
rarity = unicorns.find_one({'name': 'Rarity'})
pprint.pprint(rarity)

{u'_id': ObjectId('5992063c4d7fa4644596e950'),
 u'friends': [u'TwilightSparkle', u'Applejack', u'Fluttershy'],
 u'name': u'Rarity'}


The same selector strategies can be used for more complex queries in `pymongo`

In [11]:
friend_of_twilight = unicorns.find_one({'friends': 'TwilightSparkle'})
pprint.pprint(friend_of_twilight)

{u'_id': ObjectId('598cf4506ac6234d87c88cf2'),
 u'age': 34,
 u'friends': [u'TwilightSparkle'],
 u'horn': True,
 u'name': u'PrincessCelestia',
 u'wings': True}


To get multiple results back, use `find`, which returns an iterator.

In [12]:
friends_of_twilight = unicorns.find({'friends': 'TwilightSparkle'})
for friend in friends_of_twilight:
    pprint.pprint(friend)

{u'_id': ObjectId('598cf4506ac6234d87c88cf2'),
 u'age': 34,
 u'friends': [u'TwilightSparkle'],
 u'horn': True,
 u'name': u'PrincessCelestia',
 u'wings': True}
{u'_id': ObjectId('598cf4b89f41cd237793876a'),
 u'age': 15.0,
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'horn': False,
 u'name': u'Applejack',
 u'wings': False}
{u'_id': ObjectId('598cf4c19f41cd237793876b'),
 u'age': 15.0,
 u'friends': [u'Applejack', u'TwilightSparkle'],
 u'horn': False,
 u'name': u'Fluttershy',
 u'wings': True}
{u'_id': ObjectId('598e1eb0c64e701ec1c26625'),
 u'age': 16,
 u'friends': [u'TwilightSparkle', u'Applejack', u'Rarity'],
 u'name': u'RainbowDash'}
{u'_id': ObjectId('5992063c4d7fa4644596e950'),
 u'friends': [u'TwilightSparkle', u'Applejack', u'Fluttershy'],
 u'name': u'Rarity'}
{u'_id': ObjectId('5a20358ddbc88be0fef1783c'),
 u'age': 15.0,
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'horn': False,
 u'name': u'Applejack',
 u'wings': False}
{u'_id': ObjectId('5a2038e5c64e7042ef8fd352'),
 u'a

In [13]:
young_unicorns = unicorns.find({'age': {'$lt': 16}})
for unicorn in young_unicorns[:2]:
    pprint.pprint(unicorn)

{u'_id': ObjectId('598cf4b89f41cd237793876a'),
 u'age': 15.0,
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'horn': False,
 u'name': u'Applejack',
 u'wings': False}
{u'_id': ObjectId('598cf4c19f41cd237793876b'),
 u'age': 15.0,
 u'friends': [u'Applejack', u'TwilightSparkle'],
 u'horn': False,
 u'name': u'Fluttershy',
 u'wings': True}


**Exercise:** Find all the unicorns that have a horn and wings.

In [None]:
unicorns.find({})

In [None]:
young_unicorns = unicorns.find({'age': {'$lt': 16}})

### Aggregations are not something you will do a ton of while working in Mongo, but here is an example

In [14]:
popular_unicorns = unicorns.aggregate(
[
 {'$project': {
    'name': 1,
    'numberOfFriends': { 
        '$size': "$friends" 
                        }
               }
 }
])

In [15]:
for row in popular_unicorns:
    print(row)

{u'numberOfFriends': 2, u'_id': ObjectId('598cf4506ac6234d87c88cf1'), u'name': u'TwilightSparkle'}
{u'numberOfFriends': 1, u'_id': ObjectId('598cf4506ac6234d87c88cf2'), u'name': u'PrincessCelestia'}
{u'numberOfFriends': 0, u'_id': ObjectId('598cf4506ac6234d87c88cf3'), u'name': u'Nightmare Moon'}
{u'numberOfFriends': 2, u'_id': ObjectId('598cf4b89f41cd237793876a'), u'name': u'Applejack'}
{u'numberOfFriends': 2, u'_id': ObjectId('598cf4c19f41cd237793876b'), u'name': u'Fluttershy'}
{u'numberOfFriends': 3, u'_id': ObjectId('598e1eb0c64e701ec1c26625'), u'name': u'RainbowDash'}
{u'numberOfFriends': 3, u'_id': ObjectId('5992063c4d7fa4644596e950'), u'name': u'Rarity'}
{u'numberOfFriends': 2, u'_id': ObjectId('5a20358ddbc88be0fef1783c'), u'name': u'Applejack'}
{u'numberOfFriends': 3, u'_id': ObjectId('5a2038e5c64e7042ef8fd352'), u'name': u'RainbowDash'}


# Afternoon: Web Scraping using requests and BeautifulSoup

## HTML Concepts

**H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think markdown) that forms the building blocks of all websites.  Controls what to say and where to say it, along with some semantic meaning (this is a section, this is a list, this part is emphasised).

Consists of tags enclosed in angle brackets (like `<html>`)

A minimal HTML document, unfortuantely, contains a lot of cruft.  Here's one I got from [https://www.sitepoint.com/a-minimal-html-document/](https://www.sitepoint.com/a-minimal-html-document/).


```html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>title</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
		
  </body>
</html>
```

The `<link>` and `<script>` tags are not strictly necessary, but will appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls who different parts of the docuemnt are rendered in the browser.  This makes things pretty.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behaviour* to a html document.
* The `<body>` tag contains the guts of your document.

### Important Tags

```html
<div>Defines a division or section of the docuemnt.</div>
<a href="http://www.w3schools.com">A Gyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p>This is a paragraph!</p>

<h2>This is a Subheading!</h2>

<table>
  This is a table!
  <tr>
    <td>An entry in the first row.</td>
    
    <td>Another entry in the first row.</td>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is a list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
```

I saved the HTML document described above as `basic.html`.

## Web vs Internet

The web or www or (World Wide Web), is different from Internet.  You can think of the web as collection of islands and internet as bridges connecting the islands.

I.e. the web is a collection of *content* and the internet is the *infastructure for accessing and distributing* this content.

HTTP is the language of the Web.  Data on the web is distributed as HTTP documents.

## HTTP Requests

To get data from the web, you need to make a HTTP request.  The two most important request types are:

* GET (queries data, no data is *sent*)
* POST (updates data, *data must be sent*)

`curl` is a command line program for sending HTTP requests.  It's easy to send a `GET` request to a url.

In [None]:
!curl http://madrury.github.io

## Scraping

Web Scraping is the process of programatically getting data from the web.

<img src="images/pipeline.png" width = 500>

### Example: Load table into a data frame.

Lets load the Super Metroid speedrun leaderboards at [Deer Tier](http://deertier.com/Leaderboard/AnyPercentRealTime) into a Mongo database, and then load this database into a pandas data frame.

In [22]:
import warnings
warnings.filterwarnings('ignore')

In [23]:
import copy
import pandas as pd

# Requests sends and recieves HTTP requests.
import requests

# Beautiful Soup parses HTML documents in python.
from bs4 import BeautifulSoup

#### Step 1: Check out the website in a browser.

The first step is to check out the website in a browser.

Open the `Developer Tools` to get a useful display of the hypertext we will be working with.

The table we will need is inside a `<div>` with `class=scoreTable`.  Looking closely the structure is like this:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

Each row has a `title` attribute that contains some interesting data:

```
<tr title="Submitted by Oatsngoats on: 19/10/2016">
```

Inside each row, the columns have the following data:

```
rank, player, time, video url, comment
```

This should be enough infomation for us to get to scraping.

#### Step 2: Send a GET request for the data.

In [24]:
deer_tier_url = 'http://deertier.com/Leaderboard/AnyPercentRealTime'
r = requests.get(deer_tier_url)

A status code of `200` means that everything went well.

In [25]:
r.status_code

200

We can check out the raw hypertext in the `content` attribute of the request.

In [26]:
r.content

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head runat="server">\r\n    \r\n    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png" />\r\n    <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" />\r\n    <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16" />\r\n    <link rel="manifest" href="/manifest.json" />\r\n    <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#fec200" />\r\n    <meta name="apple-mobile-web-app-title" content="Deer Tier" />\r\n    <meta name="application-name" content="Deer Tier" />\r\n    <meta name="theme-color" content="#331f52" />\r\n    <meta name="viewport" content="width=device-width" />\r\n\r\n    <link href="/CSS/Reset.css" rel="Stylesheet" type="text/css" />\r\n    <link href="/CSS/font-awesome.min.css" rel="Stylesheet" type="text/css"  />\r\n    <

#### Step 3: Save all the hypertext into mongo for later use.

In [27]:
client = MongoClient('mongodb://localhost:27017/')
db = client.metroid
pages = db.pages

pages.insert_one({'html': r.content})

<pymongo.results.InsertOneResult at 0x11721f870>

#### Step 4: Parse the hypertext with BeautifulSoup.

This is the beautiful part of the soup.  Parsing the HTML into a python object is effortless.

In [28]:
soup = BeautifulSoup(r.content, "lxml")

In [29]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head runat="server">
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="/manifest.json" rel="manifest"/>
  <link color="#fec200" href="/safari-pinned-tab.svg" rel="mask-icon"/>
  <meta content="Deer Tier" name="apple-mobile-web-app-title"/>
  <meta content="Deer Tier" name="application-name"/>
  <meta content="#331f52" name="theme-color"/>
  <meta content="width=device-width" name="viewport"/>
  <link href="/CSS/Reset.css" rel="Stylesheet" type="text/css"/>
  <link href="/CSS/font-awesome.min.css" rel="Stylesheet" type="text/css"/>
  <link href="/CSS/Styles.css?v=21" rel="Stylesheet" type="text/css"/>
  <script src="/Scripts/jq

In [30]:
print soup.title

<title>Any% Real Time - Deer Tier</title>


#### Step 5: Navigate the data to pull out the table information.

Recall the structure of the table we are looking for:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

In [32]:
div = soup.find("div", {"class": "scoreTable"})
table = div.find("table")

# This returns an iterator over the rows in the table.
rows = table.find_all("tr")

all_rows = []

# Let's store each row as a dictionary 
empty_row = {
    "rank": None, "player": None, "time": None, "comment": None
}

# The first row contains header information, so we are skipping it.
for row in rows[1:]:
    new_row = copy.copy(empty_row)
    # A list of all the entries in the row.
    columns = row.find_all("td")
    new_row['rank'] = int(columns[0].text.strip())
    new_row['player'] = columns[1].text.strip()
    new_row['time'] = columns[2].text.strip()
    new_row['comment'] = columns[4].text.strip()
    all_rows.append(new_row)    

In [33]:
pprint.pprint(all_rows[:4])

[{'comment': u'Just breathe',
  'player': u'Behemoth',
  'rank': 1,
  'time': u'41:33'},
 {'comment': u'guitar% WR', 'player': u'zoast', 'rank': 2, 'time': u'41:42'},
 {'comment': u'Mole%', 'player': u'Oatsngoats', 'rank': 3, 'time': u'41:56'},
 {'comment': u'Goofy quarter halfie.',
  'player': u'kottpower',
  'rank': 4,
  'time': u'42:20'}]


#### Step 6: Load all the rows into a Mongo database.

Since we collected all the rows into python dictionaries, this is easy.

In [34]:
db = client.metroid

In [35]:
deer_tier = db.deer_tier

In [36]:
for row in all_rows:
    deer_tier.insert_one(row)

Now we can check from the command line that the data is really in there!

#### Step 7: Load all the rows into a pandas dataframe.

Even though there is no real reason to, let's load all the rows from the Mongo database just to give a more thorough example of how you can go about things.

In [37]:
rows = deer_tier.find()
super_metroid_times = pd.DataFrame(list(rows))

In [38]:
super_metroid_times.head()

Unnamed: 0,_id,comment,player,rank,time
0,5a206ab0c64e7042ef8fd358,Just breathe,Behemoth,1,41:33
1,5a206ab0c64e7042ef8fd359,guitar% WR,zoast,2,41:42
2,5a206ab0c64e7042ef8fd35a,Mole%,Oatsngoats,3,41:56
3,5a206ab0c64e7042ef8fd35b,Goofy quarter halfie.,kottpower,4,42:20
4,5a206ab0c64e7042ef8fd35c,choke artist,Twocat,5,42:29


In [None]:
super_metroid_times = super_metroid_times.drop("_id", axis=1)
super_metroid_times = super_metroid_times.set_index("rank")
super_metroid_times.head()

Goal Achieved!

<details><summary>
Q: **Large-ish Exercise**: Scrape the leaderboads for [Ocarana of Time](http://zeldaspeedruns.com/leaderboards/oot/any) into a dataframe.?
</summary>
```
zelda_url = 'http://zeldaspeedruns.com/leaderboards/oot/any'
rz = requests.get(zelda_url)

soup = BeautifulSoup(rz.content, "lxml")

#All of these below are equivalent for searching
tab = soup.find("table", class_='runs table')
#tab = soup.find("table")
#tab = soup.find("table", {"class":'runs table'})

# This returns an iterator over the rows in the table.
rows = tab.find_all("tr")

all_rows = []

# Let's store each row as a dictionary 
empty_row = {
    "rank": None, 
    "player": None, 
    "time": None, 
    "version": None,
    "date": None,
    "status": None,
}

# The first row contains header information, so we are skipping it.
for row in rows[1:]:
    new_row = copy.copy(empty_row)
    # A list of all the entries in the row.
    columns = row.find_all("td")
    new_row['rank'] = int(columns[0].text.strip())
    new_row['player'] = columns[1].text.strip()
    new_row['time'] = columns[2].text.strip()
    new_row['version'] = columns[3].text.strip()
    new_row['date'] = columns[4].text.strip()
    new_row['comment'] = columns[5].text.strip()
    all_rows.append(new_row)  
    ```
</details>

## Example: Use a web API to scrape Brewery Location

I used this API programatically collect dat for a beery recommendation engine.  This service is *designed* for programmers to interact with.

[Brewery Map API Documentation](https://beermapping.com/api/)

[Yelp API](https://www.yelp.com/developers/documentation/v2/search_api)

A high level summary of the documentation:

#### Step 1: Get the Data

In [39]:
import json
import re
import requests
from bs4 import BeautifulSoup
from collections import defaultdict

Notice a few things about the function below:
     - We need to pass a key to get info back
     - We specify the criteria of the query in the URL

In [None]:
#Do NOT do this, and make sure not to push this to GitHub
API_KEY = 'a670f41f909ce02de585b95a55fb8e9a'

http://beermapping.com/webservice/loccity/a670f41f909ce02de585b95a55fb8e9a/phoenix,az

def breweries_near_me(citystate):
    """Function to query the beermapping.com API for brewerys need a given city, state
    
        Input(str): A string of the city and state in the format phoenix,az
        
        Output(dict): returns a dictionary with a 
            key 'brewery_results' and v
            value: a list of dictionaries of brewery specific info
    
    """
    #Create default dict for results from the API
    brewery_ratings = []
    
    #Format input for API consumption
    form_city_state = citystate.lower().replace(' ','')
    
    #Make a call to the beer mapping API for the user specified city, st.
    breweries_content = requests.get('http://beermapping.com/webservice/loccity/{}/{}'.format(API_KEY, 
                                                                                              form_city_state))

    #Make soup specific object from request API
    content = BeautifulSoup(breweries_content.content, 'html.parser')
    
    #Iterate through all results from the API request
    for brew_name, brew_id, status, zip_, phone in zip(content.findAll('name'), \
                                                       content.findAll('id'), \
                                                       content.findAll('status'), \
                                                       content.findAll('zip'), \
                                                       content.findAll('phone')):

        #Skip beer store results to only return brewery information
        if status.string != 'Beer Store':
            temp_dict = dict()
            temp_dict['name'] = brew_name.string
            temp_dict['phone'] = phone.string
            temp_dict['zip'] = zip_.string
            temp_dict['rating'] = brew_rating(brew_id.string)
            brewery_ratings.append(temp_dict)
        else:
            pass
        
    return brewery_ratings


def brew_rating(brewery_id):
    """Helper function to breweries_near_me
    
    
        Input(str): Brewery ID
        
        Output(dict): various ratings associated with the user defined brewery
    """
    rating_dict = {}
    
    brew_rating = requests.get('http://beermapping.com/webservice/locscore/{}/{}'.format(API_KEY,
                                                                                         brewery_id))
    content = BeautifulSoup(brew_rating.content, 'html.parser')
    rating_dict['overall'] = content.find('overall').string
    rating_dict['selection'] = content.find('selection').string
    rating_dict['service'] = content.find('service').string
    rating_dict['atmosphere'] = content.find('atmosphere').string
    rating_dict['food'] = content.find('food').string
    rating_dict['reviewcount'] = content.find('reviewcount').string
    
    return rating_dict

In [None]:
brewery_list = breweries_near_me('phoenix,az')

#### Step 2: Store the Data in MongoDB

In [None]:
# import MongoDB modules
from pymongo import MongoClient
#from bson.objectid import ObjectId

# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')
db = client.beer_db

In [None]:
collection = db.breweries

In [None]:
for brewery in brewery_list:

    if not collection.find_one(brewery['name']):
        collection.insert_one(brewery)

In [None]:
for pub in collection.find():
    print(pub)

In [None]:
zelda_url = 'http://zeldaspeedruns.com/leaderboards/oot/any'
rz = requests.get(zelda_url)

soup = BeautifulSoup(rz.content, "lxml")

#All of these below are equivalent for searching
tab = soup.find("table", class_='runs table')
#tab = soup.find("table")
#tab = soup.find("table", {"class":'runs table'})

# This returns an iterator over the rows in the table.
rows = tab.find_all("tr")

all_rows = []

# Let's store each row as a dictionary 
empty_row = {
    "rank": None, 
    "player": None, 
    "time": None, 
    "version": None,
    "date": None,
    "status": None,
}

# The first row contains header information, so we are skipping it.
for row in rows[1:]:
    new_row = copy.copy(empty_row)
    # A list of all the entries in the row.
    columns = row.find_all("td")
    new_row['rank'] = int(columns[0].text.strip())
    new_row['player'] = columns[1].text.strip()
    new_row['time'] = columns[2].text.strip()
    new_row['version'] = columns[3].text.strip()
    new_row['date'] = columns[4].text.strip()
    new_row['comment'] = columns[5].text.strip()
    all_rows.append(new_row)  

In [None]:
import requests
import json

In [None]:
import requests

client_id = 'i7qXEh7xelvV5FXt_n5wnA'
client_secret = 'WsjCqbcuBGtCL0cLBZWd2p5eOaXZu6B8d5yVPqY5PMxX4kp4GWkevYHG06o30XJK'

token = requests.post("https://api.yelp.com/oauth2/token", data={'client_id': client_id, 
                                                                 'client_secret': client_secret})

access_token = token.json()['access_token']

url = 'https://api.yelp.com/v3/businesses/search'
headers = {'Authorization': 'bearer {}'.format(access_token)}
params = {'location': 'sf', 
          'category_filter':'gastropubs'
         }

resp = requests.get(url=url, params=params, headers=headers)

import pprint
pprint.pprint(resp.json()['businesses'])

In [42]:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'http://instagram.com/umnpics/'
driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

for x in soup.findAll('li', {'class':'photo'}):
    print x

WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home


In [45]:
r2 = requests.get('https://www.sportscapping.com/')
siz = BeautifulSoup(r2.content)

In [49]:
req = requests.get('https://www.sportscapping.com/images/sportscapping/services/5759_service_photo.png')

In [51]:
with open('test.png','w') as pic:
    pic.write(req.content)

In [52]:
url = 'http://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=board+games&_sacat=0'

In [96]:
soup = BeautifulSoup(requests.get(url).content)

In [58]:
!mkdir pics

In [98]:
for i in soup.find_all('img imgWr2'):
    print(i.find('img')['src'])
    temp = requests.get(i.find('img')['src'])
    print(temp)
    with open("pic/{}".format(i['src'].split('/')[-1]), 'w') as pic:
        pic.write(temp.content)
    

In [114]:
for i in soup.find_all(class_='img imgWr2'):
    pic_url = i.find('img')['src']
    temp = requests.get(pic_url)
    print(pic_url.split('/')[-1])
    with open("pics/{}".format(pic_url.split('/')[-1]), 'w') as pic:
        pic.write(temp.content)

s-l225.jpg


In [113]:
!ls

README.md                  [1m[34mpics[m[m
basic.html                 unicorns.json
geckodriver.log            web_scraping_lecture.ipynb
[1m[34mimages[m[m
