# Web Scraping and HTML Concepts


### Morning Objectives:
* *Describe* a typical web scraping data pipeline
* *Compare and Contrast* SQL and noSQL
* *Perform* basic operations using Mongo

### Afternoon Objectives:
* *Explain* the basic concepts of HTML
* *Learn how to* write code to pull elements from a web page
* *Use* an existing API to fetch data and parse using BeautifulSoup

## Resources

* [Precourse-Web Awareness](https://github.com/zipfian/precourse/tree/master/Chapter_8_Web_Awareness)
* [The Little MongoDB Book](http://openmymind.net/mongodb.pdf)
* [w3 schools](http://www.w3schools.com/) : HTML tags and thier attributes.
* [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html)
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)

## Installing Mongo and PyMongo

### Mongo
1. Install MongoDB: `brew install mongodb`
2. Start MongoDB: `brew services start mongodb`

#### Do *not* run services as `root`.  Ever.  Even if someone tells you to.

### PyMongo
2. Install PyMongo: `conda install pymongo`

### SQL vs NoSQL

NoSQL does not stand for 'No SQL'. SQL is useful for many things, it's not going away.

> NoSQL ==> "Not Only SQL"

It's a different Paradigm to deal with messy data that does not lend itself to an RDBMS.  It's also very useful as a quick and ppainless solution to data storage, where a full relational database model takes much thought and investment.

### Mongo Clients

The command line program we use to interact with mongo is a *client*.  It's only job is to send messages to another program, a *server*, which holds all our data and knows how to operate on it.

The command line Mongo client is written in javascript, so interacting with mongo with this client looks like writing javascript code.

<img src="images/client-server.png" width = 500>

There are other clients.  Late on we will use `pymongo` to interact with our databases from python.

## Javascript Object Notation

Javascript Object Notation, or JSON, is a simple data storage and communication protocall.  It was designed by [Douglas Crockford](https://en.wikipedia.org/wiki/Douglas_Crockford) based on the notation Javascipt uses for objects.

It is meant as a replacement to XML.

```javascript
{
    name: 'TwilightSparkle',
    friends: ['Applejack', 'Fluttershy'],
    age: 16,
    gender: 'f',
    wings: true,
    horn: true,
    residence: {
        town: 'Ponyville',
        address: '15 Gandolfini Lane'}
}
```

It is very similar to a python dictionary literal, but you *cannot use double quotes to enclose strings*.

## Working with Mongo DB

### MongoDB Concepts

#### What's it about? 

* MongoDB is a document-oriented database, an alternative to RDBMS, used for storing semi-structured data.
* JSON-like objects form the data model, rather than RDBMS tables.
* No schema, No joins, No transactions.
* Sub-optimal for complicated queries.

#### Structure of the database.

* MongoDB is made up of databases which contain collections (tables).
* A collection is made up of documents (analogous to rows or records).
* Each document is a JSON object made up of key-value pairs (analogous to columns).


So a RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

I created a `unicorns.json` file that can be imported into MongoDB.

```
mongoimport --db unicorns --collection unicorns < unicorns.json
```

Now start mongo. 

A MongoDB contains a collection of databases, so lets check that the `unicorns` database exists.

```
> show dbs
```

To use the `unicorns` database, we simply do the following:

```
> use unicorns
```

A database is made of `collection`s, which are containers for the actual stored data.  A `collection` would be analagous to a `table` in a classical relational database, but can contain much more flexible data than a table.

```
> db.getCollectionNames()
```

### Inserting Data
```javascript
db.unicorns.insert({
    name: 'Applejack',
    age: 15,
    friends: ['TwilightSparkle', 'Fluttershy'],
    wings: false,
    horn: false
})

db.unicorns.insert({
    name: 'Fluttershy',
    age: 15,
    friends: ['Applejack', 'TwilightSparkle'],
    wings: true,
    horn: false
})
                 
```

## Querying Data

Without any arguments, `find` dumps all the data in the collection

```javascript
db.unicorns.find()
```

`find` is much more flexible.

```javascript
// find by single field
db.unicorns.find({name: 'TwilightSparkle'})

// find by presence of field
db.unicorns.find({friends: {$exists : true}})

// find by value in array
db.unicorns.find({friends: 'TwilightSparkle'})

// To return only certain fields
// This says, return only the names of unicorns who are friends with
// twilight sparkle.
db.unicorns.find({friends: 'TwilightSparkle'}, {name: true})
```

**Excercise**: Try to find all the unicorns with wings.  Then find only the friends of unicorns with wings.

## Updating Data

```javascript
// Replaces friends array
db.unicorns.update({
    name: 'TwilightSparkle'}, {
    $set: {
        friends: ['Shutterfly', 'Rarity', 'Applejack']}})

// Adds to friends array
db.users.update({
    name: 'Applejack'}, {
    $push: {
        friends: "Rarity"}})
```

We have to use the `$set` and `$push` operators, the default behaviour of `update` is to *replace the data*.

```javascript
// Replaces the TwighlightSparkle data completely!
// It will no longer have even a name field after this!
db.unicorns.update({
    name: 'TwilightSparkle'}, {
    friends: ['Shutterfly', 'Rarity', 'Applejack']}
```


An `upsert` either creates a document (when it does not already exist) or inserts into an existing document.

```
// Upsert: This one is created
db.unicorns.update({
    name: "Rarity"}, {
    $push: {
        friends: {
            $each: ["TwilightSparkle", "Applejack", "Fluttershy"]}}}, {
    upsert: true})

// Upsert: This one is updated
db.unicorns.update({
    name: "Fluttershy"}, {
    $push: {
        friends: {
            $each: ["Rarity", "PrincessCelestia"]}}}, {
    upsert: true})
```

**Excercise**: Enter a unicorn named RainbowDash into the database that is friends with TwilightSparkle, Rarity, and Applejack.

## Deleting Data

*Don't run this one!*

```javascript
db.unicorns.remove()
```

## PyMongo


`pymongo` allows python to connect to and manipulate MongoDB.

In [1]:
from pymongo import MongoClient
import pprint

In [2]:
# Connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

In [3]:
db = client.unicorns

In [4]:
# Create a collection called users
unicorns = db.unicorns

In [5]:
unicorns.insert_one({
    'name': 'RainbowDash', 
    'age': 16, 
    'friends': ['TwilightSparkle', 'Applejack', 'Rarity']})

<pymongo.results.InsertOneResult at 0x104aa27d0>

In [6]:
unicorns.find().count()

6

In [8]:
pprint.pprint(unicorns.find_one())

{u'_id': ObjectId('58994634aa7f50eb31ed3298'),
 u'age': 16,
 u'friends': [u'Shutterfly', u'Rarity', u'Applejack'],
 u'gender': u'f',
 u'horn': True,
 u'name': u'TwilightSparkle',
 u'residence': {u'address': u'15 Gandolfini Lane', u'town': u'Ponyville'},
 u'wings': True}


In [8]:
rarity = unicorns.find_one({'name': 'Rarity'})
pprint.pprint(rarity)

{u'_id': ObjectId('5897b920aa7f50eb31ed1dd2'),
 u'friends': [u'AppleJack', u'TwilightSparkle', u'Applejack', u'Fluttershy'],
 u'name': u'Rarity'}


The same selector strategies can be used for more complex queries in `pymongo`

In [9]:
friend_of_twilight = unicorns.find_one({'friends': 'TwilightSparkle'})
pprint.pprint(friend_of_twilight)

{u'_id': ObjectId('5897b54dfbdd78fae6434063'),
 u'age': u'15',
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'name': u'Applejack',
 u'wings': False}


To get multiple results back, use `find`, which returns an iterator.

In [10]:
friends_of_twilight = unicorns.find({'friends': 'TwilightSparkle'})
for friend in friends_of_twilight:
    pprint.pprint(friend)

{u'_id': ObjectId('5897b54dfbdd78fae6434063'),
 u'age': u'15',
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'name': u'Applejack',
 u'wings': False}
{u'_id': ObjectId('5897b54ffbdd78fae6434064'),
 u'age': u'15',
 u'friends': [u'Applejack',
              u'TwilightSparkle',
              u'Rarity',
              u'PrincessCelestia'],
 u'name': u'Fluttershy',
 u'wings': False}
{u'_id': ObjectId('5897b920aa7f50eb31ed1dd2'),
 u'friends': [u'AppleJack', u'TwilightSparkle', u'Applejack', u'Fluttershy'],
 u'name': u'Rarity'}
{u'_id': ObjectId('5897bc1f91de8e89aa35bba6'),
 u'age': 16,
 u'friends': [u'TwilightSparkle', u'Applejack', u'Rarity'],
 u'name': u'RainbowDash'}
{u'_id': ObjectId('5897fe42aa7f50eb31ed29ee'),
 u'age': 34,
 u'friends': [u'TwilightSparkle'],
 u'horn': True,
 u'name': u'PrincessCelestia',
 u'wings': True}
{u'_id': ObjectId('5898c7b691de8e8da9715160'),
 u'age': 16,
 u'friends': [u'TwilightSparkle', u'Applejack', u'Rarity'],
 u'name': u'RainbowDash'}


In [9]:
young_unicorns = unicorns.find({'age': {'$lt': 16}})
for unicorn in young_unicorns:
    pprint.pprint(unicorn)

{u'_id': ObjectId('589946bf3c68dd9ac76e2ff1'),
 u'age': 15.0,
 u'friends': [u'TwilightSparkle', u'Fluttershy'],
 u'horn': False,
 u'name': u'Applejack',
 u'wings': False}
{u'_id': ObjectId('589946f03c68dd9ac76e2ff2'),
 u'age': 15.0,
 u'friends': [u'Applejack', u'TwilightSparkle'],
 u'horn': False,
 u'name': u'Fluttershy',
 u'wings': True}


**Exercise:** Find all the unicorns that have a horn and wings.

# Afternoon: Web Scraping using requests and BeautifulSoup

## HTML Concepts

**H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think markdown) that forms the building blocks of all websites.  Controls what to say and where to say it, along with some semantic meaning (this is a section, this is a list, this part is emphasised).

Consists of tags enclosed in angle brackets (like `<html>`)

A minimal HTML document, unfortuantely, contains a lot of cruft.  Here's one I got from [https://www.sitepoint.com/a-minimal-html-document/](https://www.sitepoint.com/a-minimal-html-document/).


```html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
  <head>
  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>title</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
		
  </body>
</html>
```

The `<link>` and `<script>` tags are not strictly necessary, but will appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls who different parts of the docuemnt are rendered in the browser.  This makes things pretty.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behaviour* to a html document.
* The `<body>` tag contains the guts of your document.

### Important Tags

```html
<div>Defines a division or section of the docuemnt.</div>
<a href="http://www.w3schools.com">A Gyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p>This is a paragraph!</p>

<h2>This is a Subheading!</h2>

<table>
  This is a table!
  <tr>
    <td>An entry in the first row.</td>
    
    <td>Another entry in the first row.</td>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is a list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
```

I saved the HTML document described above as `basic.html`.

## Web vs Internet

The web or www or (World Wide Web), is different from Internet.  You can think of the web as collection of islands and internet as bridges connecting the islands.

I.e. the web is a collection of *content* and the internet is the *infastructure for accessing and distributing* this content.

HTTP is the language of the Web.  Data on the web is distributed as HTTP documents.

## HTTP Requests

To get data from the web, you need to make a HTTP request.  The two most important request types are:

* GET (queries data, no data is *sent*)
* POST (updates data, *data must be sent*)

`curl` is a command line program for sending HTTP requests.  It's easy to send a `GET` request to a url.

In [10]:
!curl http://madrury.github.io

<!DOCTYPE html>
<html>

   <script src="https://d3js.org/d3.v3.min.js" charset="utf-8"></script>
 <script src="/js/fourier-polynomial.js"></script>
 <script src="/js/oscilloscope.js"></script>
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>Scatterplot Smoothers</title>
  <meta name="description" content="Musings and ramblings on statistics, math, and computer programming.
">

  <link rel="stylesheet" href="/css/main.css">
  <link rel="canonical" href="http://yourdomain.com/">
  <link rel="alternate" type="application/rss+xml" title="Scatterplot Smoothers" href="http://yourdomain.com/feed.xml">

</head>


  <body>

    <header class="site-header">

  <div class="wrapper">

    <a class="site-title" href="/">Scatterplot Smoothers</a>

`curl` can also send POST requests, but with a bit more effort.

In [11]:
!curl -X POST -H "Content-Type: application/json" -H 'User-Agent: DataWrangling/1.1 matthew.drury@galvanize.com' -d '{"action": "parse", "format": "json", "page": "Unicorn"}' https://en.wikipedia.org/w/api.php  

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>MediaWiki API help - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"Special","wgCanonicalSpecialPageName":"ApiHelp","wgNamespaceNumber":-1,"wgPageName":"Special:ApiHelp","wgTitle":"ApiHelp","wgCurRevisionId":0,"wgRevisionId":0,"wgArticleId":0,"wgIsArticle":false,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":true,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","J

We're going to send this POST request in a much better way below, so don't worry about remembering how to do it with curl.

## Scraping

Web Scraping is the process of programatically getting data from the web.

<img src="images/pipeline.png" width = 500>

### Example: Load table into a data frame.

Lets load the Super Metroid speedrun leaderboards at [Deer Tier](http://deertier.com/Leaderboard/AnyPercentRealTime) into a Mongo database, and then load this database into a pandas data frame.

In [12]:
import warnings
warnings.filterwarnings('ignore')

In [13]:
import copy
import pandas as pd

# Requests sends and recieves HTTP requests.
import requests

# Beautiful Soup parses HTML documents in python.
from bs4 import BeautifulSoup

#### Step 1: Check out the website in a browser.

The first step is to check out the website in a browser.

Open the `Developer Tools` to get a useful display of the hypertext we will be working with.

The table we will need is inside a `<div>` with `class=scoreTable`.  Looking closely the structure is like this:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

Each row has a `title` attribute that contains some interesting data:

```
<tr title="Submitted by Oatsngoats on: 19/10/2016">
```

Inside each row, the columns have the following data:

```
rank, player, time, video url, comment
```

This should be enough infomation for us to get to scraping.

#### Step 2: Send a GET request for the data.

In [14]:
deer_tier_url = 'http://deertier.com/Leaderboard/AnyPercentRealTime'
r = requests.get(deer_tier_url)

A status code of `200` means that everything went well.

In [15]:
r.status_code

200

We can check out the raw hypertext in the `content` attribute of the request.

In [16]:
r.content

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head runat="server">\r\n    \r\n    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png" />\r\n    <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" />\r\n    <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16" />\r\n    <link rel="manifest" href="/manifest.json" />\r\n    <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#fec200" />\r\n    <meta name="apple-mobile-web-app-title" content="Deer Tier" />\r\n    <meta name="application-name" content="Deer Tier" />\r\n    <meta name="theme-color" content="#331f52" />\r\n    <meta name="viewport" content="width=device-width" />\r\n\r\n    <link href="/CSS/Reset.css" rel="Stylesheet" type="text/css" />\r\n    <link href="/CSS/font-awesome.min.css" rel="Stylesheet" type="text/css"  />\r\n    <

#### Step 3: Save all the hypertext into mongo for later use.

In [29]:
db = client.metroid
pages = db.pages

pages.insert_one({'html': r.content})

<pymongo.results.InsertOneResult at 0x115b446e0>

#### Step 4: Parse the hypertext with BeautifulSoup.

This is the beautiful part of the soup.  Parsing the HTML into a python object is effortless.

In [20]:
soup = BeautifulSoup(r.content, "lxml")

In [21]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head runat="server">
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="/manifest.json" rel="manifest"/>
  <link color="#fec200" href="/safari-pinned-tab.svg" rel="mask-icon"/>
  <meta content="Deer Tier" name="apple-mobile-web-app-title"/>
  <meta content="Deer Tier" name="application-name"/>
  <meta content="#331f52" name="theme-color"/>
  <meta content="width=device-width" name="viewport"/>
  <link href="/CSS/Reset.css" rel="Stylesheet" type="text/css"/>
  <link href="/CSS/font-awesome.min.css" rel="Stylesheet" type="text/css"/>
  <link href="/CSS/Styles.css?v=21" rel="Stylesheet" type="text/css"/>
  <script src="/Scripts/jq

In [19]:
print soup.title

<title>Any% Real Time - Deer Tier</title>


#### Step 5: Navigate the data to pull out the table information.

Recall the structure of the table we are looking for:

```
<div class=scoreTable>
  <table>
    <tr>..</tr>
    ...
    <tr>...</tr>
  </table>
</div>
```

In [22]:
div = soup.find("div", {"class": "scoreTable"})
table = div.find("table")

# This returns an iterator over the rows in the table.
rows = table.find_all("tr")

all_rows = []

# Let's store each row as a dictionary 
empty_row = {
    "rank": None, "player": None, "time": None, "comment": None
}

# The first row contains header information, so we are skipping it.
for row in rows[1:]:
    new_row = copy.copy(empty_row)
    # A list of all the entries in the row.
    columns = row.find_all("td")
    new_row['rank'] = int(columns[0].text.strip())
    new_row['player'] = columns[1].text.strip()
    new_row['time'] = columns[2].text.strip()
    new_row['comment'] = columns[4].text.strip()
    all_rows.append(new_row)    

In [23]:
pprint.pprint(all_rows[:4])

[{'comment': u'Mole%',
  'player': u'Oatsngoats',
  'rank': 1,
  'raw_row': None,
  'time': u'41:56'},
 {'comment': u'',
  'player': u'zoast',
  'rank': 2,
  'raw_row': None,
  'time': u'41:58'},
 {'comment': u'Goofy quarter halfie.',
  'player': u'kottpower',
  'rank': 3,
  'raw_row': None,
  'time': u'42:20'},
 {'comment': u'choke artist',
  'player': u'Twocat',
  'rank': 4,
  'raw_row': None,
  'time': u'42:29'}]


#### Step 6: Load all the rows into a Mongo database.

Since we collected all the rows into python dictionaries, this is easy.

In [24]:
db = client.metroid

In [25]:
deer_tier = db.deer_tier

In [26]:
for row in all_rows:
    deer_tier.insert_one(row)

Now we can check from the command line that the data is really in there!

#### Step 7: Load all the rows into a pandas dataframe.

Even though there is no real reason to, let's load all the rows from the Mongo database just to give a more thorough example of how you can go about things.

In [27]:
rows = deer_tier.find()
super_metroid_times = pd.DataFrame(list(rows))

In [28]:
super_metroid_times.head()

Unnamed: 0,_id,comment,player,rank,raw_row,time
0,58994d0991de8e913dd5592c,Mole%,Oatsngoats,1,,41:56
1,58994d0991de8e913dd5592d,,zoast,2,,41:58
2,58994d0991de8e913dd5592e,Goofy quarter halfie.,kottpower,3,,42:20
3,58994d0991de8e913dd5592f,choke artist,Twocat,4,,42:29
4,58994d0991de8e913dd55930,Route needs more door transitions :kappa:,Behemoth,5,,42:45


In [29]:
super_metroid_times = super_metroid_times.drop("_id", axis=1)
super_metroid_times = super_metroid_times.set_index("rank")
super_metroid_times.head()

Unnamed: 0_level_0,comment,player,raw_row,time
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Mole%,Oatsngoats,,41:56
2,,zoast,,41:58
3,Goofy quarter halfie.,kottpower,,42:20
4,choke artist,Twocat,,42:29
5,Route needs more door transitions :kappa:,Behemoth,,42:45


Goal Achieved!

## Example: Use a web API to scrape Wikipedia

Wikipedia provides a free API to programatically collect data.  This service is *designed* for programmers to interact with.

[Wikipedia API Documentation](https://www.mediawiki.org/wiki/API:Main_page)

A high level summary of the documentation:

> Send a POST request to https://en.wikipedia.org/w/api.php with a JSON payload describing the data you want, and the format in which you want it.

#### Step 1: Get the Data

In [30]:
import json
import re

Wikipedia wants us to identify ourselves before it will give us data.  The `User-Agent` section of a HTTP header contains this information.

In [31]:
headers = {'User-Agent': 'GalvanizeDataWrangling/1.1 matthew.drury@galvanize.com'}

In [32]:
api_url = 'https://en.wikipedia.org/w/api.php'

# Parameters for the API request: We want the Unicorn page encoded as json.
payload = {'action': 'parse', 'format': 'json', 'page': "Unicorn"}

r = requests.post(api_url, data=payload, headers=headers)

In [33]:
print(r.json().keys())

[u'parse']


We get a lot of data back!

In [34]:
print(r.json()['parse'])

{u'templates': [{u'*': u'Template:Other uses', u'ns': 10, u'exists': u''}, {u'*': u'Template:Pp-move-indef', u'ns': 10, u'exists': u''}, {u'*': u'Template:Pp-semi-vandalism', u'ns': 10, u'exists': u''}, {u'*': u'Template:Pp-vandalism', u'ns': 10, u'exists': u''}, {u'*': u'Template:Infobox mythical creature', u'ns': 10, u'exists': u''}, {u'*': u'Template:Infobox', u'ns': 10, u'exists': u''}, {u'*': u'Template:Convert', u'ns': 10, u'exists': u''}, {u'*': u'Template:Lang-el', u'ns': 10, u'exists': u''}, {u'*': u'Template:Language with name and transliteration', u'ns': 10, u'exists': u''}, {u'*': u'Template:Lang', u'ns': 10, u'exists': u''}, {u'*': u'Template:Category handler', u'ns': 10, u'exists': u''}, {u'*': u'Template:ISO 639 name', u'ns': 10, u'exists': u''}, {u'*': u'Template:ISO 639 name el', u'ns': 10, u'exists': u''}, {u'*': u'Template:Citation needed', u'ns': 10, u'exists': u''}, {u'*': u'Template:Fix', u'ns': 10, u'exists': u''}, {u'*': u'Template:Fix/category', u'ns': 10, u'ex

#### Step 2: Store the Data in MongoDB

In [35]:
# import MongoDB modules
from pymongo import MongoClient
from bson.objectid import ObjectId

# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')
db = client.wikipedia

In [36]:
collection = db.wikipedia

In [37]:
if not collection.find_one(r.json()['parse']):
    collection.insert_one(r.json()['parse'])

In [38]:
unicorn_article = collection.find_one({ "title" : "Unicorn"})

In [39]:
pprint.pprint(unicorn_article)

{u'_id': ObjectId('5898c7bc91de8e8da971533a'),
 u'categories': [{u'*': u'Pages_using_ISBN_magic_links',
                  u'hidden': u'',
                  u'sortkey': u''},
                 {u'*': u'Wikipedia_indefinitely_move-protected_pages',
                  u'hidden': u'',
                  u'sortkey': u'Unicorn'},
                 {u'*': u'Wikipedia_pages_semi-protected_against_vandalism',
                  u'hidden': u'',
                  u'sortkey': u'Unicorn'},
                 {u'*': u'Articles_containing_Greek-language_text',
                  u'hidden': u'',
                  u'sortkey': u''},
                 {u'*': u'All_articles_with_unsourced_statements',
                  u'hidden': u'',
                  u'sortkey': u''},
                 {u'*': u'Articles_with_unsourced_statements_from_April_2010',
                  u'hidden': u'',
                  u'sortkey': u''},
                 {u'*': u'Articles_with_unsourced_statements_from_April_2015',
                  u'

In [40]:
print unicorn_article.keys()

[u'templates', u'iwlinks', u'pageid', u'links', u'langlinks', u'title', u'text', u'revid', u'externallinks', u'images', u'displaytitle', u'_id', u'sections', u'properties', u'categories']


#### Step 3: Retrieve and store every article (with associated metadata) within one link

We want to hop from the 'Unicorn' article. *Do not follow external links, only linked Wikipedia articles*

HINT: The Unicorn Law article should be located at: 
'http://en.wikipedia.org/w/api.php?action=parse&format=json&page=Unicorn'

In [41]:
links = unicorn_article['links']

pprint.pprint(links)

[{u'*': u'Wikipedia:Protection policy', u'exists': u'', u'ns': 4},
 {u'*': u'Wikipedia:Accuracy dispute', u'exists': u'', u'ns': 4},
 {u'*': u'Wikipedia:Citation needed', u'exists': u'', u'ns': 4},
 {u'*': u'Template:ISO 639 name el', u'exists': u'', u'ns': 10},
 {u'*': u'Template:ISO 639 name he', u'exists': u'', u'ns': 10},
 {u'*': u'Template:Heraldic creatures', u'exists': u'', u'ns': 10},
 {u'*': u'Category:Articles containing Greek-language text',
  u'exists': u'',
  u'ns': 14},
 {u'*': u'Category:Articles with unsourced statements from April 2010',
  u'exists': u'',
  u'ns': 14},
 {u'*': u'Category:Articles with unsourced statements from April 2015',
  u'exists': u'',
  u'ns': 14},
 {u'*': u'Category:Articles with disputed statements from September 2016',
  u'exists': u'',
  u'ns': 14},
 {u'*': u'Category:Articles with unsourced statements from September 2011',
  u'exists': u'',
  u'ns': 14},
 {u'*': u'Category:Articles containing Hebrew-language text',
  u'exists': u'',
  u'ns':

Now let's request each of these documents, and store the result in our collection.

In [42]:
for link in links:

    payload = {'action': 'parse' ,'format': 'json', 'page' : link['*'] }
    r = requests.post(api_url, data=payload, headers=headers)

    # check to first see if the document is already in our database, if not, store it.
    try:
        j = r.json()
        if not collection.find_one(j['parse']):
            print("Writing The Article: {}".format(j['title']))
            collection.insert_one(j['parse'])
    # This is awful programming.
    except:
        continue

#### Step 4: Find all articles that mention 'Horn' or 'Horned' (case insensitive)

* Use regular expressions in order to search the content of the articles for the terms Zipf or Zipfian. 
* We only want articles that mention these terms in the displayed text however, so we must first remove all the unnecessary HTML tags and only keep what is in between the relevant tags. 
* Beautiful Soup makes this almost trivial. Explore the documentation to find how to do this effortlessly: http://www.crummy.com/softwa re/BeautifulSoup/bs4/doc/

* Test out your Regular Expressions before you run them over every document you have in your database: http://pythex.org/. Here is some useful documentation on regular expressions in Python: http://docs .python.org/2/howto/regex.html

* Once you have identified the relevant articles, save them to a file for now, we do not need to persist them in the database.

In [43]:
# compile our regular expression since we will use it many times
regex = re.compile('Horn | Horned', re.IGNORECASE)

with open('wiki_articles.txt', 'w+b') as out:

    for doc in collection.find():
        
        # Extract the HTML from the document
        html = doc['text']['*']

        # Stringify the ID for serialization to our text file
        doc['_id'] = str(doc['_id'])

        # Create a Beautiful Soup object from the HTML
        soup = BeautifulSoup(html)

        # Extract all the relevant text of the web page: strips out tags and head/meta content
        text = soup.get_text()

        # Perform a regex search with the expression we compiled earlier
        match = regex.search(text)

        # if our search returned an object (it matched the regex), write the document to our output file
        if match:
            try:
                print("Writing Article: {}".format(doc['title']))
                json.dump(doc, out) 
                out.write('\n')
            except UnicodeEncodeError:
                pass

    out.close()

Writing Article: Unicorn
