# Web Scraping and HTML Concepts


## Today's Objectives: 
### Morning:
* *Walk through* a typical web data pipeline
* *Compare and Contrast* SQL and noSQL
* *Perform* basic operations using Mongo and PyMongo
* *Explain* the basic concepts of HTML

### Afternoon:
* *Learn how to* write code to pull elements from a web page
* *Use* an existing API to fetch data and parse using BeautifulSoup


## 1. Resources

* [Precourse-Web Awareness](https://github.com/zipfian/precourse/tree/master/Chapter_8_Web_Awareness)
* [The Little MongoDB Bok](http://openmymind.net/mongodb.pdf)
* [w3 schools](http://www.w3schools.com/)
* [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html)
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)



## 2. Installing Mongo and PyMongo

### Mongo
1. Install MongoDB: [https://docs.mongodb.org/manual/installation/](https://docs.mongodb.org/manual/installation/)

2. Much like Postgres, you will need to launch the server before using Mongo. 

   - If the directory `/data/db` does not exist, create it by: `mkdir -p /data/db`
   - To start the Mongo server: `sudo mongod`
   - The server usually runs on `localhost:27017`
   - Open a new terminal and type `mongo` to open up a Mongo shell
   - Type `show dbs;` to show the databases you have
   - You can exit by typing `exit`


### PyMongo
2. Install PyMongo: [http://api.mongodb.org/python/current/installation.html](http://api.mongodb.org/python/current/installation.html)

## 3. SQL vs NoSQL

* Contrary to what some folks may want, NoSQL does not stand for 'No SQL'.
* Different Paradigm to deal with messy data that does not lend itself to an RDBMS
* A NoSQL stack may include a RDBMS component, Redis to handle queuing and Hadoop for Big Data processing
* NoSQL ==> "Not Only SQL"


## 4. Typical Pipeline
<img src="pipeline.png" width = 500>

## 5. MongoDB Concepts

* MongoDB is a document-oriented database, an alternative to RDBMS
* Used for storing semi-structured data
* JSON-like objects form the data model, rather than RDBMS tables
* No schema, No joins, No transactions
* Sub-optimal for complicated queries

* MongoDB is made up of databases which contain collections (tables)
* A collection is made up of documents (rows or records)
* Each document is made up of fields (columns)

* RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

* CURSOR: When you ask MongoDB for data, it returns a pointer to the result set called a cursor, which we can do things to, such as counting or skipping ahead, before actually pulling down data. Actual execution is delayed until necessary.



### Mongo Clients
<img src="client-server.png" width = 500>


## 6.  Create a Database and do some operations

* Mongo can create databases, collections, documents, etc. on the fly. 
* To create a new database simply try to use the database you haven't created: use my_new_database

## Inserting Data
```
db.users.insert({ name: 'Jon', age: '45', friends: [ 'Henry', 'Ashley']})

show dbs
db.getCollectionNames()

db.users.insert({ name: 'Ashley', age: '37', friends: [ 'Jon', 'Henry']})
db.users.insert({ name: 'Frank', age: '17', friends: [ 'Billy'], car : 'Civic'})

db.users.find()
```
* Mongo creates the _id field by default

## Querying Data
```
// find by single field
db.users.find({ name: 'Jon'})

// find by presence of field
db.users.find({ car: { $exists : true } })

// find by value in array
db.users.find({ friends: 'Henry' })

// field selection (only return name)
db.users.find({}, { name: true })
```


# MongoDB Example

## PyMongo


In [None]:
# import MongoDB modules
from pymongo import MongoClient

In [None]:
# connect to the hosted MongoDB instance
client = MongoClient('mongodb://localhost:27017/')

In [None]:
db = client.lb_test

In [None]:
users = db.lb_users

In [None]:
users.insert_one({'name':'lekha', 'city':'seattle'})

In [None]:
users.insert_one({'name':'joe', 'city':'new york' })

In [None]:
users.find().count()

In [None]:
users.find_one()

In [None]:
t = users.find_one({'name': 'lekha'})
t

In [None]:
users.find().count()

# 7. HTML Concepts

* HyperText Markup Language
* A markup language that forms the building blocks of all websites
* Consists of tags enclosed in angle brackets (like <html>)

### Important Tags

```html
<div>Defines a division or section</div>
<a href="http://www.w3schools.com">Link to W3Schools.com!</a>
<table>Will contain a table</table>
<p>This is a paragraph</p>
<h1>This is a header!<h1>
<ul>
    <li>This is a list</li>
</ul>
```

# 8. CSS
(Cascading Style Sheets)
* Enable the separation of document content from document presentation
* Controls aspects such as the layout, colors, and fonts.
* "Cascading" is used because the most specific rule is chosen


## CSS Syntax

* A CSS rule-set consists of a selector and a declaration block:
* Example:
```
p {
    color: red;
    text-align: center;
}
```

* Examples of CSS selectors: ```id, class```

# Afternoon: Web Scraping using requests and BeautifulSoup

## Web vs Internet

* Web is www (World Wide Web) 
* Different from Internet
* Web as collection of islands and internet as bridges connecting the islands
* HTTP is the language of the Web

## Types of HTTP requests

* GET (queries data)
* POST (updates data)
* PUT (updates data)
* DELETE (updates data)

## API
* API is a way for developers to communicate with a certain application against a specific contract
* An API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.

* Send queries through the URL
 * Google: geolocations
 * yelp: restaurants/reviews
 * Zillow: housing info/ demograpics
 * Socrata: government data

In [None]:
# import the Requests HTTP library
import requests

# import the Beautiful Soup module 
from bs4 import BeautifulSoup

## Scraping

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Diurnal_cycle')

In [None]:
soup = BeautifulSoup(r.content)

In [None]:
print(soup.prettify())

In [None]:
print soup.title

In [None]:
for a in soup.findAll('link'):
    print a['href']
