# Web Scraping and HTML Concepts


## DSI Standards: Web Scraping

* Compare and contrast SQL and NoSQL
* Complete basic operations with mongo
* Explain the basic concepts of HTML
* Write Python code to pull out an element from a web page
* Fetch data from an existing API

### Morning:
* *Describe* a typical web scraping data pipeline
* *Compare and Contrast* SQL and noSQL
* *Perform* basic operations using Mongo
* *Explain* the basic concepts of HTML

### Afternoon:
* *Learn how to* write code to pull elements from a web page
* *Use* an existing API to fetch data and parse using BeautifulSoup


## 1. Resources

* [Precourse-Web Awareness](https://github.com/zipfian/precourse/tree/master/Chapter_8_Web_Awareness)
* [The Little MongoDB Bok](http://openmymind.net/mongodb.pdf)
* [w3 schools](http://www.w3schools.com/)
* [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html)
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)



## 2. Installing Mongo and PyMongo

### Mongo
1. Install MongoDB: `brew install mongodb`
2. Start MongoDB: `brew services start mongodb`

#### Do *not* run services as `root`.

### PyMongo
2. Install PyMongo: `conda install pymongo`

## 3. Typical Pipeline
<img src="images/pipeline.png" width = 500>

## 4. SQL vs NoSQL

* Contrary to what some folks may want, NoSQL does not stand for 'No SQL'.
* Different Paradigm to deal with messy data that does not lend itself to an RDBMS
* A NoSQL stack may include a RDBMS component, Redis to handle queuing and Hadoop for Big Data processing
* NoSQL ==> "Not Only SQL"


## 5. MongoDB Concepts

* MongoDB is a document-oriented database, an alternative to RDBMS
* Used for storing semi-structured data
* JSON-like objects form the data model, rather than RDBMS tables
* No schema, No joins, No transactions
* Sub-optimal for complicated queries

* MongoDB is made up of databases which contain collections (tables)
* A collection is made up of documents (analogous to rows or records)
* Each document is made up of key-value pairs (analogous to columns)

* RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

* CURSOR: When you ask MongoDB for data, it returns a pointer to the result set called a cursor.

* Actual execution is delayed until necessary.



### Mongo Clients
<img src="images/client-server.png" width = 500>


## 6.  Create a Database and do some operations

* Mongo can create databases, collections, documents, etc. on the fly. 
* To create a new database simply try to use the database you haven't created: use my_new_database

## Inserting Data
```
db.users.insert({name: 'Jon',
                 age: '45',
                 friends: ['Henry', 'Ashley']
                 })

show dbs
db.getCollectionNames()

db.users.insert({name: 'Ashley',
                 age: '37',
                 friends: ['Jon', 'Henry']
                 })
                 
db.users.insert({name: 'Frank',
                 age: '17',
                 friends: ['Billy'],
                 car: 'Civic'})

db.users.find()
```
* Mongo creates the _id field by default

## Querying Data
```
// find by single field
db.users.find({ name: 'Jon'})

// find by presence of field
db.users.find({ car: { $exists : true } })

// find by value in array
db.users.find({ friends: 'Henry' })

// field selection (only return name)
db.users.find({}, { name: true })
```


## Updating Data

```
// replaces friends array
db.users.update({name: "Jon"}, { $set: {friends: ["Phil"]}})

// adds to friends array
db.users.update({name: "Jon"}, { $push: {friends: "Susie"}})

// upsert
db.users.update({name: "Stevie"}, { $push: {friends: "Nicks"}}, true)

// multiple updates
db.users.update({}, { $set: { activated : false } }, false, true)
```

## Deleting Data
```
db.users.remove({})
```

# MongoDB Example

## PyMongo


In [1]:
# import MongoDB modules
from pymongo import MongoClient

In [2]:
# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

In [3]:
db = client.lb_tst1

In [4]:
# Create a collection called users
users = db.lb_tst1

In [5]:
users.insert_one({'name':'lekha', 'city':'seattle'})

<pymongo.results.InsertOneResult at 0x10647f410>

In [6]:
users.insert_one({'name':'joe', 'city':'new york' })

<pymongo.results.InsertOneResult at 0x10647f5a0>

In [7]:
users.find().count()

4

In [8]:
users.find_one()

{u'_id': ObjectId('56e723d2feb488233188c7bc'),
 u'city': u'seattle',
 u'name': u'lekha'}

In [9]:
t = users.find_one({'name': 'lekha'})
t

{u'_id': ObjectId('56e723d2feb488233188c7bc'),
 u'city': u'seattle',
 u'name': u'lekha'}

In [10]:
users.find().count()

4

# 7. HTML Concepts

* HyperText Markup Language
* A markup language that forms the building blocks of all websites
* Consists of tags enclosed in angle brackets (like <html>)

### Important Tags

```html
<div>Defines a division or section</div>
<a href="http://www.w3schools.com">Link to W3Schools.com!</a>
<table>Will contain a table</table>
<p>This is a paragraph</p>
<h1>This is a header!<h1>
<ul>
    <li>This is a list</li>
</ul>
```

# 8. CSS
(Cascading Style Sheets)
* Enable the separation of document content from document presentation
* Controls aspects such as the layout, colors, and fonts.
* "Cascading" is used because the most specific rule is chosen


## CSS Syntax

* A CSS rule-set consists of a selector and a declaration block:
* Example:
```
p {
    color: red;
    text-align: center;
}
```

* Learn more about CSS Syntax here: http://www.w3schools.com/css/css_syntax.asp

# Afternoon: Web Scraping using requests and BeautifulSoup

## Web vs Internet

* Web is www (World Wide Web) 
* Different from Internet
* Web as collection of islands and internet as bridges connecting the islands
* HTTP is the language of the Web

## Types of HTTP requests

* GET (queries data)
* POST (updates data)
* PUT (updates data)
* DELETE (updates data)

## API
* API is a way for developers to communicate with a certain application against a specific contract
* An API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

* Send queries through the URL
 * Google: geolocations
 * yelp: restaurants/reviews
 * Zillow: housing info/ demograpics
 * Socrata: government data

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# import the Requests HTTP library
import requests

# import the Beautiful Soup module 
from bs4 import BeautifulSoup

## Scraping

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Pi_Day')

In [None]:
soup = BeautifulSoup(r.content)

In [None]:
print(soup.prettify())

In [None]:
print soup.title

In [None]:
for a in soup.findAll('link'):
    print a['href']


In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

## In Class Exercise: Use an API to scrape Wikipedia

### Step 1: Get the Data

In [1]:
# import the Requests HTTP library
import requests
import json
import re
# A User agent header required for the Wikipedia API.
headers = {'user_agent': 'DataWrangling/1.1 (lekha.bhargavi@galvanize.com; dsi example exercise)'}

In [4]:
# Experiment with fetching one or two pages and examining the result (fill in URL and payload)
url = 'http://en.wikipedia.org/w/api.php'

# parameters for the API request
payload = { 'action' : 'parse' , 'format' : 'json','page' : "Zipf's law" }

# make the request
r = requests.post(url, data=payload, headers=headers)

# print out the result of the request as JSON
# print r.json()['parse'];

{u'templates': [{u'*': u'Template:Probability distribution', u'ns': 10, u'exists': u''}, {u'*': u'Template:Infobox probability distribution', u'ns': 10, u'exists': u''}, {u'*': u'Template:Infobox', u'ns': 10, u'exists': u''}, {u'*': u'Template:IPAc-en', u'ns': 10, u'exists': u''}, {u'*': u'Template:Div col', u'ns': 10, u'exists': u''}, {u'*': u'Template:Column-count', u'ns': 10, u'exists': u''}, {u'*': u'Template:Div col end', u'ns': 10, u'exists': u''}, {u'*': u'Template:Reflist', u'ns': 10, u'exists': u''}, {u'*': u'Template:Cite journal', u'ns': 10, u'exists': u''}, {u'*': u'Template:Citation', u'ns': 10, u'exists': u''}, {u'*': u'Template:Cite conference', u'ns': 10, u'exists': u''}, {u'*': u'Template:Cite book', u'ns': 10, u'exists': u''}, {u'*': u'Template:Cite web', u'ns': 10, u'exists': u''}, {u'*': u'Template:ISSN', u'ns': 10, u'exists': u''}, {u'*': u'Template:Hide in print', u'ns': 10, u'exists': u''}, {u'*': u'Template:Trim', u'ns': 10, u'exists': u''}, {u'*': u'Template:On

### Step 2: Store the Data in MongoDB

In [5]:
# import MongoDB modules
from pymongo import MongoClient
from bson.objectid import ObjectId

# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

# connect to the wikipedia database: if it does not exist it will automatically create it -- one reason why mongoDB can be nice.
db = client.wikipedia

In [6]:
# create a new collection that is unique to me, this is a necessity since we are all sharing a database 
collection = db.sea_feb

In [16]:
type(r.json())

dict

In [7]:
# try storing the document you retrieved earlier in MongoDB (be careful to not store duplicates!)
if not collection.find_one(r.json()['parse']):
    collection.insert(r.json()['parse'])

In [8]:
# now see if you can query the database for the article you just stored
zipf = collection.find_one({ "title" : "Zipf's law"})

In [None]:
print zipf

### Step 3: Retrieve and store every article (with associated metadata) within 1

hop from the 'Zipf's law' article. *Do not follow external links, only linked Wikipedia articles*

HINT: The Zipf's Law article should be located at: 
'http://en.wikipedia.org/w /api.php?action=parse&format=json&page=Zipf's%20law'

In [10]:
# grab the list of linked Wikipedia articles from the API result 
links = zipf['links']

print type(links)
links

<type 'list'>


[{u'*': u'Template:Probability distributions', u'exists': u'', u'ns': 10},
 {u'*': u'ARGUS distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Arcsine distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Asymmetric Laplace distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Balding\u2013Nichols model', u'exists': u'', u'ns': 0},
 {u'*': u'Bates distribution', u'exists': u'', u'ns': 0},
 {u'*': u"Benford's law", u'exists': u'', u'ns': 0},
 {u'*': u'Benini distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Benktander type II distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Benktander type I distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Beno\xeet Mandelbrot', u'exists': u'', u'ns': 0},
 {u'*': u'Bernoulli distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Beta-binomial distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Beta distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Beta negative binomial distribution', u'exists': u'', u'ns': 0},
 {u'*': u'Beta prime distribution', u'e

In [11]:
# iterate over each link and store the returned document in MongoDB
for link in links:

    # parameters for API request
    payload = { 'action' : 'parse' , 'format' : 'json','page' : link['*'] }

    r = requests.post(url, data=payload, headers=headers)

    # check to first see if the document is already in our database, if not... store it!
    #print payload
    try:
        if not collection.find_one(r.json()['parse']):
            collection.insert(r.json()['parse'])
    except:
        continue

### Step 4: Find all articles that mention 'Zipf' or 'Zipfian' (case insensitive)

* Use regular expressions in order to search the content of the articles for the terms Zipf or Zipfian. 
* We only want articles that mention these terms in the displayed text however, so we must first remove all the unnecessary HTML tags and only keep what is in between the relevant tags. 
* Beautiful Soup makes this almost trivial. Explore the documentation to find how to do this effortlessly: http://www.crummy.com/softwa re/BeautifulSoup/bs4/doc/

* Test out your Regular Expressions before you run them over every document you have in your database: http://pythex.org/. Here is some useful documentation on regular expressions in Python: http://docs .python.org/2/howto/regex.html

* Once you have identified the relevant articles, save them to a file for now, we do not need to persist them in the database.

In [12]:
# import the Beautiful Soup module 
from bs4 import BeautifulSoup

count = 0

# compile our regular expression since we will use it many times
regex = re.compile('Zipf | Zipfian', re.IGNORECASE)

# create an output file if it doesn't already exist, and open it in binary mode
out = open('zipfian.txt', 'w+b')

# iterate over every document we have stored
for doc in collection.find():
    # extract the HTML from the document
    html = doc['text']['*']

    # stringify the ID for serialization to our text file
    doc['_id'] = str(doc['_id'])

    # create a Beautiful Soup object from the HTML
    soup = BeautifulSoup(html)

    # extract all the relevant text of the web page: strips out tags and head/meta content
    text = soup.get_text()

    # perform a regex search with the expression we compiled earlier
    m = regex.search(text)

    # if our search returned an object (it matched the regex), write the document to our output file
    if m:
        count += 1
        json.dump(doc, out) 
        out.write('\n')

# close the opened output file for good measure
out.close()

### Step 5: Augmentation! Time to remix the web... or rather just Wikipedia.

But hey, isn't Wikipedia the web.

We want to augment our Zipfian Wikipedia articles with content from the WWW at large. 
Stepping out of the walled garden of collaboratively edited document safety... let us scrape! 
For each of the articles we found to contain 'Zipf' or 'Zipfian', we want to know what the web has to say. 
For each of the external links of said articles, fetch the linked webpage and extract the ```<title> and <meta name="keywords">``` from the HTML. Beautiful Soup would probably help you a lot here.

You still have to watch out for pages without keywords or a title

Once you have extracted this information, update the stored document in your database with this information. Add a field called 'extraexternal' that contains the additional contextual information. 'extraexternal' should be an array of JSON objects, each of which have keys:

'url' : the url of the page
'title' : the title of the page
'keywords' : the keywords from the meta tag

In [13]:
import urlparse

In [14]:
# re-open our output file of matched articles 
articles = open('zip1.txt', 'r')

# iterate over each article that contains 'Zipf' or 'Zipfian'
for line in articles:
    doc = json.loads(line)

    # extract the external links from the Wikipedia article
    links = doc['externallinks']

    # deserialize our document ID into a Mongo ObjectID
    _id = ObjectId(doc['_id'])

    # create an empty 'extraexternal' array to store the results of our web scraping
    collection.update( { '_id' : _id }, { '$set' :  { 'extraexternal' : [] } } )

    # iterate over the URLs of the external links
    for url in links:
        # sometimes the URLs are malformed, split the URL into its component parts to fix
        scheme, netloc, path, qs, anchor = urlparse.urlsplit(url)

        # if there is not a scheme specified (what comes before the ://) default to HTTP
        scheme = scheme if scheme else 'http'

        # rejoin the fixed components into a URL string
        fixed_url = urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

        # make the request to grab the content of the external link
        html = requests.get(fixed_url)

        # soupify the HTML so we can traverse/parse it for what we are looking for
        soup = BeautifulSoup(html.text)

        # extract the title from the page
        title = soup.title

        # extract the keywords from the meta tag
        keywords = soup.find('meta', attrs={'name' : 'keywords'})

        # create a object to store our extra data related to each external link
        augment = { 
                    'url' : url, 
                    'title' : title.string if title else "", 
                    'keywords' : keywords['content'].split(',') if keywords else [] 
                  }

        # update the document we already have stored in MongoDB with our additional information
        collection.update({ '_id' : _id }, { '$push' : { 'extraexternal' : augment } } )

In [15]:
import pprint as pp

for doc in collection.find({'extraexternal' : { '$exists' :  True } }):
    pp.pprint(doc['title'])
    pp.pprint(doc['extraexternal'])
    print '\n'

u"Zipf's law"
[{u'keywords': [],
  u'title': u'',
  u'url': u'http://aclweb.org/anthology/W98-1218'},
 {u'keywords': [],
  u'title': u'Handbook of Empirical Economics and Finance - Google Books',
  u'url': u'http://books.google.com/books?hl=en&lr=&id=QAUv9R6bJzwC&oi=fnd&pg=PA139'},
 {u'keywords': [],
  u'title': u'',
  u'url': u'http://apachepersonal.miun.se/~mageri/myresearch/bmsb2013-Eriksson.pdf'},
 {u'keywords': [],
  u'title': u'Zipf, Power-law, Pareto - a ranking tutorial',
  u'url': u'http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html'},
 {u'keywords': [u'Computational linguistics',
                u'Probability distribution',
                u'Statistical distributions',
                u'Language',
                u'Monte Carlo method',
                u'Random variables',
                u'Probability density',
                u'Law of large numbers'],
  u'title': u'PLOS ONE: Large-Scale Analysis of Zipf\u2019s Law in English Texts',
  u'url': u'//dx.doi.org/10.13