# Web Scraping
<br>
## MongoDB
<br>
### Cary Goltermann

# Objectives

1. Understand motivation for web scraping:
  * What does a web data pipeline look like?
  * How should we store data from the web?
2. Know high level differences between NoSQL and SQL.
3. Perform basic operations using Mongo Shell.

<center><h2>The Reality of Scraping</h2></center><br>
<center><img src="images/scraping_meme.png" style="width: 800px"></center>

# Why do we scrape the web?

* Realistically, data that you want to study won't always be avaliable to you in the form of a curated data set.
* Need to go to the internets to find interesting data:
    * From an existing company
    * Text for NLP
    * Images

<center><h2>Web Data Pipeline</h2></center><br>
<center><img src="images/web_data_pipeline.png" style="width: 800px"></center>

# Storing Data from the Web

* We already know how to store data -> SQL (RDBMS).

### Question:  Individual - 1 min
* Why wouldn't SQL necessarily be the best tool for storing data that we retrieve from the web?

### Data are MESSY! Especially when you find them online!
* Why? A site might:
    * have changed the way the reference data internally
    * have different conventions team-to-team, or even developer-to-developer
    * deprecated old versions of sites but still have legacy code maintaining them

# Why Not SQL?

### Question: Pair - 2 mins
* Given these aspects of online data, why might using SQL be hard for data collection?

### SCHEMAS
* You need a schema, a priori, when setting up a relational database.
* This might be difficult to impossible to ahve without looking at all of the data you want to scrape.
* Catch 22, you need all your data to figure out the schema, but you need the schema so you can make a database to store your data.

# NoSQL

* Stands for **N**ot **o**nly **SQL**. MongoDB is a flavor of NoSQL, like PosgreSQL is a flavor of SQL.
    * A NoSQL paradigm may be preferable to SQL because it is schemaless.
    * Great for storing unstructured data, as we may find on the web!
    * MongoDB is a document-oriented DBMS.

<center><h2>Centered around "Documents"</h2></center><br>
<center><img src="images/document_based_storage.png" style="width: 800px"></center>

<center><h1>Different Usage Methodologies,</h1></center>
<center><h1>Different Databases</h1></center> 

# SQL Methodology

Want to prevent reduncancy in data by having tables with unique information and relations between them (normalized data).

* Creates a framework for querying with joins.
* Makes it easier to update database. Only ever have to change information in a single place.
* This can result in "simple" queries being slower, but more complex queries are often faster.

# Mongo Methodology

Document based storage system. Does not enforce normalized data. Can have data redundancies in documents (denormalized data).
* No joins.
* A change to database generally results in needing to change many documents.
* Since there is redundancy in the documents, simple queries are generally faster. But complex queries are often slower.

<center><h2>SQL vs. Mongo"</h2></center><br>
<center><img src="images/sql_vs_mongo_table.png" style="width: 900px"></center>

## Connecting to Mongo

In practice, there two main ways that you will be connecting with Mongo:
* From Python
* From the console - shell
    
<center><img src="images/mongo_clients.png" style="width: 500px"></center>

## Mongo Clients

Both the Mongo shell and PyMongo require a Mongo server to be running. In practice this will require you to start a Mongo Daemon process. To do this exectute the command `mongod` at the terminal.

* Note: The Mongo daemon will need to occupy the terminal that you started it in for the life of the server session. Read: run `mongod` in a seperate terminal tab (or tmux).

Now we're going to explore the interacting with the mongo shell. This is not the main way that you'll be interacting with mongo while scraping, we have Python for that. But it's good to know how to issue queries from the shell for various reasons.

* Check slack for command to connect to remote Mongo server.

## Mongo Shell Demo Code

#### Using Mongo - General Commands for Inspecting Mongo

```javascript
help                        // List top level mongo commands

db.help()                   // List database level mongo commands

db.<collection name>.help() // List collection level mongo commands.

show dbs                    // Get list of databases on your system

use <database name>         // Change current database

show collections            // List collections in current database
```

## Inserting

Once you're using a database you refer to it with the name **db**. Collections within databases are accessible through dot notation.

```javascript
db.users.insert({ name: 'Jon', age: '45', friends: [ 'Henry', 'Ashley']})

db.getCollectionNames()  // Another way to get collection list

db.users.insert({ name: 'Ashley', age: '37', friends: [ 'Jon', 'Henry']})
db.users.insert({ name: 'Frank', age: '17',
                  friends: [ 'Billy'], car : 'Civic'})

db.users.find()
```
* Note: The three documents that we inserted into the above database didn't all have the same fields.
* Note: Mongo creates an ` _id` field for each document if one isn't provided.

## Querying

```javascript
db.users.find({ name: 'Jon'})               // find by single field

db.users.find({ car: { $exists : true } })  // find by presence of field

db.users.find({ friends: 'Henry' })         // find by value in array

db.users.find({}, { name: true })   // field selection (only return name)
```

A quick way to figure out how to write a Mongo query is to think about how you would do it in SQL and check out a resource like this Mongo endorsed [conversion guide](https://docs.mongodb.com/manual/reference/sql-comparison/#create-and-alter), or use something like a [query translator](http://www.querymongo.com/).

## Updating

```javascript
// replaces friends array
db.users.update({name: "Jon"}, { $set: {friends: ["Phil"]}})

// adds to friends array
db.users.update({name: "Jon"}, { $push: {friends: "Susie"}})   

// upsert
db.users.update({name: "Stevie"}, { $push: {friends: "Nicks"}}, true)

// multiple updates
db.users.update({}, { $set: { activated : false } }, false, true)       
```

## Importing

To import existing data into a mongo database one uses `mongoimport` at the command line. In this way mongo will accept a number of data types: JSON, CSV, and TSV.

```
mongoimport --db tweets --collection coffee --file coffee-tweets.json
```

Now that we have some larger data we can see that returns from queries are not always so small.

```javascript
use tweets
db.coffee.find()
```

## Cursor

When the return from a query will display up to the first 20 documents, after that you will need to type `it` to get more. The cursor that it returns is actually an object that has many methods implemented on it and supports the command `it` to iterate through more return items.

```javascript
db.coffee.find().count()      // 122

db.coffee.find().limit(2)     // Only two documents

// Top three users by followers count
db.coffee.find().sort({ 'user.followers_count' : -1}).limit(3)
```

## Iteration

MongoDB also has a flexible shell/driver. This allows you take some action based on a query or update documents. You can use an iterator on the cursor to go document by document. In the Javascript shell we can do this with Javascript's `forEach`. `forEach` is similar to Python's iteration with the `for` loop; however, Javascript actually has a more functional approach to this type of iteration and requires that you pass a callback, a function, similar to `map` and `reduce`.

```javascript
db.coffee.find().forEach(function(doc) {
    doc.entities.urls.forEach(function(url) {
        db.urls.update({ 'url': url }, { $push: { 'user': doc.user } },
                       true)
    });
});
```

## Aggregation

Aggregations in Mongo end up being way less pretty than in SQL/Pandas. Let's just bite the bullet and take a look:

```
db.coffee.aggregate( [ { $group :
    {
        _id: "$filter_level",
        count: { $sum: 1 }
    }
}])
```

Here we are first declaring that we're going to do some sort of grouping operation. Then, as Mongo desires everything to have an `_id` field, we specify that the `_id` is going to be the filter level. And then we're going to perform a sum over each level counting 1 for each observation. This information is going to be stored in a field called `count`. What do we get back?

## Aggregation Cont.

We can also do more complicated stuff as well. Here's a query that returns the average number of friends users in this dataset by country. We need to access the country code field of the place field, but that is easy with an object oriented language like JS.

```
db.coffee.aggregate( [ { $group :
    {
        _id: "$place.country_code",
        averageFriendCount: { $max: "$user.friends_count" }
    }
}])
```

For a guide on how to convert from an SQL style aggregation to a Mongo style aggreagetion, check out this [aggreagtion conversion guide](https://docs.mongodb.com/manual/reference/sql-aggregation-comparison).

# Objectives

1. Understand motivation for web scraping:
  * What does a web data pipeline look like?
  * How should we store data from the web?
2. Know high level differences between NoSQL and SQL.
3. Perform basic operations using Mongo Shell.