# Web Scraping and HTML Concepts


### Morning Objectives:
* *Understand* the basic structure of a web site (HTML & CSS), and how data is exchanged (JSON).
* *Learn* what HTTP requests are and how to execute them in Python
* *Practice* how to write code to pull elements from a web page using BeautifulSoup.
* *Use* an existing API to fetch data and parse using BeautifulSoup.

### Afternoon Objectives:
* *Install* `mongo` and `pymongo`.
* *Compare and Contrast* SQL and noSQL.
* *Perform* basic operations using Mongo.
* *Translate* smoothly between Javascript-based CLI commands and pymongo

## Resources

* [Precourse-Web Awareness](https://github.com/zipfian/precourse/tree/master/Chapter_8_Web_Awareness)
* [The Little MongoDB Book](http://openmymind.net/mongodb.pdf)
* [w3 schools](http://www.w3schools.com/) : HTML tags and their attributes.
* [PyMongo tutorial](http://api.mongodb.org/python/current/tutorial.html)
* [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrape anonymously with Tor](https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/)

## Web Concepts

#### Web vs Internet

The web or www or (World Wide Web), is different from the Internet.  The web is a collection of *content* and the Internet is the *infastructure for accessing and distributing* this content.

HTTP is the language of the Web.  Data on the web is distributed as HTTP documents, which can contain HTML or other types of data.

#### **H**yper**T**ext **M**arkup **L**anguage

A *markup language* (think Markdown) that forms the building blocks of all websites.  Codifies the content and structure of a web page.

Consists of tags enclosed in angle brackets (like `<html>`)

A minimal HTML document:

```html
<html lang="en">
  <head>
    <title>My Web Site</title>
    <link rel="stylesheet" type="text/css" href="style.css">
    <script type="text/javascript" src="script.js"></script>
  </head>
  <body>
	This is my web site
  </body>
</html>
```

The `<link>` and `<script>` tags are not strictly necessary, but will appear in more or less every HTML document.

* The `<link>` tag points to a **stylesheet**, which controls how the document is displayed in the browser.
* The `<script>` tag points to a **javascript** program.  This allows programmers to add *dynamic behavior* to a html document.
* The `<body>` tag contains the actual contents of your site which are displayed to the user.

### The Language of the Web

Communication over the web is governed by the [Hyper-Text Transfer Protocol (HTTP)](https://www.w3.org/Protocols/rfc2616/rfc2616.html).  A client (such as a web browser acting on your behalf) makes a request to a server, and then receives a response.

###### An Example HTTP Request

![HTTP Request](img/httprequest.jpg)

###### An Example HTTP Response

![HTTP Response](img/httpresponse.jpg)

#### Requests:  

There are several different types of requests, but the most common ones are GET and POST [(see here for more)](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).

* GET asks the server to send you some resource (usually a web page)
* POST provides the server some information (like a username & password) for a contextual resource

The main difference between these two is that a POST request has a template it must conform to, or it will be rejected.  The template is defined by the API (Application Programming Interface) being used.

#### Responses:  

You may have noticed the first line of the response contains a status code.  There are a few status codes you should be familiar with [(see here for more)](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html):

* 200 - Everything is good
* 400 - Your POST request body was probably missing / not formatted correctly
* 401 - You need to authenticate for this request to be allowed
* 404 - The server couldn't find what you were looking for
* 405 - You used the wrong HTTP request type (GET/POST/etc)

### HTTP In Action

Let's visit a website!  How about the site I made for my capstone, http://www.ambient-insights.live

You can also visit it from the command line:

In [19]:
!curl http://www.ambient-insights.live

<!DOCTYPE html>
<html lang="en">

  <head>
    <!-- Global Site Tag (gtag.js) - Google Analytics -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-106760358-1"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments)};
      gtag('js', new Date());

      gtag('config', 'UA-106760358-1');
    </script>

<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-N5QWFGW');</script>
<!-- End Google Tag Manager -->

    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <meta name="description" content="Landing page for Ambient Dashboa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 10061  100 10061    0     0  10061      0  0:00:01 --:--:--  0:00:01  209k


In [20]:
# Now in code:
import requests
from IPython.display import HTML
response = requests.get("http://www.ambient-insights.live")
response.status_code

200

In [21]:
response.text
#HTML(response.text)

'<!DOCTYPE html>\n<html lang="en">\n\n  <head>\n    <!-- Global Site Tag (gtag.js) - Google Analytics -->\n    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-106760358-1"></script>\n    <script>\n      window.dataLayer = window.dataLayer || [];\n      function gtag(){dataLayer.push(arguments)};\n      gtag(\'js\', new Date());\n\n      gtag(\'config\', \'UA-106760358-1\');\n    </script>\n\n<!-- Google Tag Manager -->\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\nnew Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n\'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n})(window,document,\'script\',\'dataLayer\',\'GTM-N5QWFGW\');</script>\n<!-- End Google Tag Manager -->\n\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\n    <meta name="descr

### The Structure of the Web

Okay, great; so we can send HTTP requests pretty easily in Python.  But how do we make sense out of the mess we get back?  First, we'll need to have a better understanding of what makes up a website!

Websites are generally comprised of three parts:

* Hyper-Text Markup Language (HTML) contains the page's content
* Cascading Style Sheets (CSS) tells the browser how to display the content
* JavaScript (JS) executes in the browser and enables interactivity and many other functionalities

You will be looking at HTML extensively for your web scraping efforts, so we'll be focusing on it today.

### HTML

###### Basic Tags:

* ```<h1>, <h6>, <p>``` - Heading or paragraph text
* ```<ul>, <ol>, <li>``` - Unordered list (bullets), ordered list (numbers), item in the list
* ```<img>``` - Image
* `<input>, <button>, <textarea>` - Elements which take user input [(see more)](https://www.w3schools.com/TagS/att_input_type.asp)
* `<div>, <span>` - Denotes subsections of the html
* `<a>` - Generates a clickable link

###### Basic Attributes:

* `src` - Give the source of the image or script
* `href` - The target of a link
* `type` - The type of an input element.  Many types are available; see above link for more info
* `class` - Gives one or more classes to an element.  Usually used with CSS selectors to style content
* `id` - Gives an ID to an element.  IDs should be **unique**!  Only one HTML element should have a given ID.

Classes and IDs are the most common targets for scraping!

```html
<div>Defines a division or section of the document.</div>
<a href="http://www.w3schools.com">A Hyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p class="myclass">This is a paragraph!</p>

<h2 class="myclass">This is a subheading!</h2>

<table id="myid">
  This is a table!
  <tr>
    <td>An entry in the first row.</td>
    <td>Another entry in the first row.</td>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is a list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>
```

In [22]:
HTML("""
<div>Defines a division or section of the document.</div>
<a href="http://www.w3schools.com">A Hyperlink to W3Schools.com!</a>

<h1>This is a header!</h1>

<p class="myclass">This is a paragraph!</p>

<h2 class="myclass">This is a subheading!</h2>

<table id="myid">
  This is a table!
  <tr>
    <td>An entry in the first row.</td>
    <td>Another entry in the first row.</td>
  </tr>
  <tr>
    <td>An entry in the second row.</td>
    <td>Another entry in the second row.</td>
  </tr>
</table>

<ul>
  This is a list!
  <li>This is the first thing in the list!</li>
  <li>This is the second thing in the list!</li>
</ul>""");

### Short Exercise

Go to https://www.mlssoccer.com/standings and pretend like you wanted to know the distribution of goals scored (but just the distribution, don't worry about which teams scored which amount).  What structure in the site could you take advantage of to get this information?

## Parsing HTML

So we've got a basic understanding of how web pages are structured, and we know how to retrieve the HTML from simple web sites, but what do we do with that very long string once we get it?  String splitting?  Regular expressions?  NO!

There is an oddly-named Python library called BeautifulSoup that will make parsing HTML much, much easier:

In [23]:
from bs4 import BeautifulSoup

response = requests.get("https://www.mlssoccer.com/standings")
soup = BeautifulSoup(response.text, "html.parser")
soup;

In [24]:
# Return the first thing that matches search criteria
soup.find("table");

In [25]:
# Return a list of things that match search criteria
soup.find_all("td", {"data-title":"Points"});

In [26]:
import numpy as np
np.mean([int(item.text) for item in soup.find_all("td", {"data-title":"Points"})])

46.909090909090907

There are many ways to navigate the tree-like structure of an HTML document with BeautifulSoup, but find and find_all are the most common.  Consult the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more!

## Beyond HTML

While using the web, you have probably noticed that the content of sites generally have structure.  For example, http://www.reddit.com has many posts that share the same structure but differ in content.  A very common pattern in site design is to create an HTML template, and then repeat the HTML for each element in your data while plugging things into the appropriate spot.

What this means for you - the intrepid web-scraper - is that your life would be a lot easier if you could just access the data being plugged into the templates directly instead of pulling it back out of the page.  These data sources are called Application Programming Interfaces (APIs), and they may be advertised, or you may have to go looking for them.

Reddit happily exposes their underlying data, just add /.json to the end of any Reddit link!  http://www.reddit.com/.json

### But What is JSON?

JavaScript Object Notation (JSON) is by far the most common data exchange format on the web.  As implied by the name, the syntax is the same one used to define objects in JavaScript.  You may notice that they look almost exactly the same as dictionary syntax in Python:

```javascript
{
    name: 'TwilightSparkle',
    friends: ['Applejack', 'Fluttershy'],
    age: 16,
    gender: 'f',
    wings: true,
    horn: true,
    title: null,
    residence: {
        town: 'Ponyville',
        address: '15 Gandolfini Lane'}
}
```

In [27]:
# Pony dictionary
twilight = {
    "name": "TwilightSparkle",
    "friends": ["Applejack", "Fluttershy"],
    "age": 16,
    "gender": "f",
    "wings": True,
    "horn": True,
    "title": None,
    "residence": {
        "town": "Ponyville",
        "address": "15 Gandolfini Lane"
    }
}

You can convert back and forth between JSON and Python objects using the json module

In [28]:
import json
twilight_json = json.dumps(twilight) #Dump Python object to JSON string
twilight_json

'{"name": "TwilightSparkle", "friends": ["Applejack", "Fluttershy"], "age": 16, "gender": "f", "wings": true, "horn": true, "title": null, "residence": {"town": "Ponyville", "address": "15 Gandolfini Lane"}}'

In [29]:
json.loads(twilight_json) == twilight #Load JSON string into Python object

True

A common thing for sites to do to have a smooth user experience is to load data in dynamically by making requests from the browser when they are reaching the end of the current content.  Let's see how this works on http://www.instagram.com

### Short Exercise

Visit https://www.udemy.com/courses/it-and-software/all-courses/.  Now try using requests to send a GET request to the same link.  Do you get the same results?  Why or why not?  Can you find anything better to scrape than the user-facing website?

### Extra Considerations

By default, when you send a request in requests, you are identifying yourself as not a normal user:

In [30]:
response = requests.get("http://google.com")
response.request.headers

{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

If you're scraping a site that doesn't want to be scraped, this User-Agent header is a dead giveaway

Another dead giveaway that you're scraping is the rate at which you're making requests.  A real user could not possibly look at 1,000 Instagram posts in a minute.  Generally, you should pause between requests if you're hitting the same site repeatedly.  Even sites that don't protect against web scraping will still ban your IP if they think you're trying to launch a Denial of Service (DOS) attack on them.

In [31]:
import time
pages = [1, 2]
for page in pages:
    response = requests.get("http://steamcommunity.com/app/276340/discussions/?fp={}".format(page))
    print("Page {}, {}".format(page, response.status_code))
    soup = BeautifulSoup(response.text, "html.parser")
    # Process soup object
    # Write to db
    time.sleep(2)

Page 1, 200
Page 2, 200


Something you may notice from the request in the previous cell is the question mark followed by some additional information.  These are called query parameters, and they are basically additional information that the server uses to return exactly what the user is requesting.  In the case of the Steam forums, the `fp` parameter determines which page of the forum posts will be returned.

A query string starts with a `?`, then consists of `&` delimited `key=value` pairs.  For example, `http://www.bobsknobs.com/catalogue.php?finish=nickel&sort=price_desc&pg=2` could tell `catalogue.php` to return page two of the products with a nickel finish sorted by price descending.

It's important to stress here that query parameters have no innate meaning; their value lies in how the web server interprets and uses them.

Requests requires a bit more thorough understanding of the web than other tools, and sometimes students have trouble getting it to do exactly what they want.  It's possible to emulate the way that a user actually browses the internet, and a library called Selenium exists to do exactly this:

In [32]:
from selenium import webdriver
import selenium
import time

In [33]:
browser = webdriver.Firefox()

In [34]:
browser.get("http://www.amazon.com/")

In [35]:
search_box = browser.find_element_by_css_selector("input#twotabsearchtextbox")

In [36]:
search_box.click()

In [37]:
search_box.send_keys("alexa")

In [38]:
search_button = browser.find_element_by_css_selector("div.nav-search-submit input")

In [39]:
search_button.click()
time.sleep(3)

In [40]:
products = browser.find_elements_by_css_selector("div.a-fixed-left-grid-col.a-col-right")

In [41]:
titles_prices = []
for p in products:
    try:
        title_element = p.find_element_by_css_selector("h2.s-access-title")
        title = title_element.text

        price_whole_element = p.find_element_by_css_selector("span.sx-price-whole")
        price = price_whole_element.text

        price_fractional_element = p.find_element_by_css_selector(
            "sup.sx-price-fractional")
        price += ("." + price_fractional_element.text)
    except selenium.common.exceptions.NoSuchElementException:
        continue
    titles_prices.append((title, price))

In [42]:
titles_prices

[('All-new Echo (2nd Generation) with improved sound, powered by Dolby, and a new design – Charcoal Fabric',
  '79.99'),
 ('Echo Dot (2nd Generation) - Black', '29.99'),
 ('Echo Dot (2nd Generation) - White', '29.99')]

In [43]:
browser.close()

# Afternoon

## Installing Mongo and PyMongo

### Mongo
1. Install MongoDB: `brew install mongodb`
2. Start MongoDB: `brew services start mongodb`

#### Do *not* run services as `root`.  Ever.  Even if someone tells you to.

### PyMongo
2. Install PyMongo: `conda install pymongo`

### SQL vs NoSQL

NoSQL does not stand for 'No SQL'. SQL is useful for many things, it's not going away.

> NoSQL ==> "Not Only SQL"

It's a different Paradigm to deal with messy data that does not lend itself to an RDBMS.  It's also very useful as a quick and painless solution to data storage, where a full relational database model takes much thought and investment.

### Mongo Clients

The command line program we use to interact with mongo is a *client*.  It's only job is to send messages to another program, a *server*, which holds all our data and knows how to operate on it.

The command line Mongo client is written in javascript, so interacting with mongo with this client looks like writing javascript code.

<img src="images/client-server.png" width = 500>

There are other clients.  Late on we will use `pymongo` to interact with our databases from python.

## Working with Mongo DB

### MongoDB Concepts

#### What's it about? 

* MongoDB is a document-oriented database, an alternative to RDBMS, used for storing semi-structured data.
* JSON-like objects form the data model, rather than RDBMS tables.
* No schema, No joins, No transactions.
* Sub-optimal for complicated queries.

#### Structure of the database.

* MongoDB is made up of databases which contain collections (tables).
* A collection is made up of documents (analogous to rows or records).
* Each document is a JSON object made up of key-value pairs (analogous to columns).


So a RDBMS defines columns at the table level, document oriented database defines its fields at a document level.

#### Why Use a Document Database?

![MongoDB vs. RDBMS](https://static1.squarespace.com/static/54022dc5e4b079634bab926c/t/588e0d33cd0f68ca7cec5871/1485704504058/)

Matt created a `ponies.json` file that can be imported into MongoDB.

```
mongoimport --db mlp --collection ponies < ponies.json
```

**Note**: If you are using linux, you may need to add a switch `--jsonArray`.

Now start `mongo`. 

A MongoDB contains a collection of databases, so lets check that the `ponies` database exists.

```
> show dbs
```

To use the `mlp` database, we simply do the following:

```
> use mlp
```

A database is made of collections, which are containers for the actual stored data.  A `collection` would be analagous to a `table` in a classical relational database, but can contain much more flexible data than a table.

```
> db.getCollectionNames()
```

### Inserting Data

`insert` takes one argument - the document to insert

```javascript
db.ponies.insert({
    name: 'Fluttershy',
    age: 15,
    friends: ['TwilightSparkle', 'Applejack'],
    wings: false,
    horn: false
})                 
```

A method also exists to insert many documents, which we will cover later.

## Querying Data

`find` takes two arguments, the first provides filtering criteria for documents (like a `WHERE` clause), and the second determines which fields will be returned (`SELECT`)

The default is to return all documents and all columns:

```javascript
db.ponies.find()
```

`findOne` returns the first document to match the query criteria

```javascript
db.ponies.findOne()
```

What's the deal with the `_id` field?

##### An Aside on _id:

The `_id` field is basically the same as a primary key in relational databases.  It is a unique identifier for every document in a collection.  By default, MongoDB will auto-generate a unique id for you when you insert data, but you can manually specify the `_id` field if you have a unique identifier.  An index is automatically generated on the `_id` field, which makes queries involving it very efficient

##### Back to Queries!

We can add the arguments for more functionality:

```javascript
// Find documents who have the value "TwilightSparkle" in the name field
db.ponies.find({name: 'TwilightSparkle'})

// Find documents where the friend field exists
db.ponies.find({friends: {$exists : true}})

// Find documents where "TwilightSparkle" is found in the friends field
db.ponies.find({friends: 'TwilightSparkle'})

// Return the documents where the field town inside the field residence is "Ponyville"
// Note that _id is always returned and must explicitly be set to 0
db.ponies.find({residence.town: 'Ponyville'}, {name: 1, _id: 0})
```

**Excercise**: Try to find all the ponies with wings.  Return just their names and ages.

## Updating Data

`update` takes two arguments, the first determines which documents should be updated (`WHERE`), the second determines how they should be updated.

The default behavior of `update` is to replace the entire document with what is in the second argument:

```javascript
// Replaces the Applejack data completely!
// Poor Applejack will no longer have even a name field after this!
db.ponies.update({
    name: 'Applejack'}, {
    friends: ['Shutterfly', 'Rarity', 'TwilightSparkle']})
```

We have to use the `$set` and `$push` operators:  `$set` modifies just the fields that are explicitly listed, and `$push` adds a new element to an array.

```javascript
// Replaces friends array
db.ponies.update({
    name: 'TwilightSparkle'}, {
    $set: {
        friends: ['Shutterfly', 'Rarity', 'Applejack']}})

// Adds to friends array
db.ponies.update({
    name: 'PrincessCelestia'}, {
    $push: {
        friends: 'Rarity'}})
```

An `upsert` either creates a document (when it does not already exist) or inserts into an existing document.

```javascript
// Upsert: This one is created
db.ponies.update({
    name: "Rarity"}, {
    $push: {
        friends: {
            \$each: ["TwilightSparkle", "Applejack", "Fluttershy"]}}}, {
    upsert: true})

// Upsert: This one is updated
db.ponies.update({
    name: "Fluttershy"}, {
    $push: {
        friends: {
            \$each: ["Rarity", "PrincessCelestia"]}}}, {
    upsert: true})
```

**Excercise**: Enter a pony named RainbowDash into the database that is friends with TwilightSparkle, Rarity, and Applejack.

## Deleting Data

`remove` takes one argument, a selector for which documents to delete


```javascript
// All documents match the empty document, so this will delete every document
db.ponies.remove({})
```

```javascript
// Delete all documents where the age value is less than 25
db.ponies.remove({age : { $lt : 25} })
```

## PyMongo


`pymongo` allows Python to connect to and manipulate MongoDB.

The commands are exactly the same as the shell commands above, except that we will be passing dictionaries instead of JSON and the method names are a bit more pythonic.  More specifically, your keys must be strings, and multiple-word methods are separated by _ and not capitalized:

```python
db.ponies.find_one({"age" : { "$lt" : 25 } })
```

The [documentation](https://api.mongodb.com/python/current/tutorial.html) may be helpful to you

In [1]:
from pymongo import MongoClient
import pprint

In [2]:
# Connect to the hosted MongoDB instance
# Localhost is the ip address of your own computer and maps to 127.0.0.1
# Port 27017 is the default for MongoDB
client = MongoClient('mongodb://localhost:27017/')

In [3]:
# Get the database named mlp
db = client.mlp

In [4]:
# Get (or create) the collection called ponies
ponies = db.ponies

In [5]:
# InsertOneResults tells you how many records were inserted.
ponies.insert_one({
    'name': 'RainbowDash', 
    'age': 16, 
    'friends': ['TwilightSparkle', 'Applejack', 'Rarity']})

<pymongo.results.InsertOneResult at 0x7f37afa8eb40>

In [6]:
ponies.find().count()

7

In [7]:
pprint.pprint(ponies.find_one())

{'_id': ObjectId('5a2e0a7e9c418975e80caf1a'),
 'age': 20,
 'friends': ['TwilightSparkle'],
 'horn': True,
 'name': 'PrincessCelestia',
 'wings': True}


In [8]:
rarity = ponies.find_one({'name': 'Rarity'})
pprint.pprint(rarity)

{'_id': ObjectId('5a2e0a7e9c418975e80caf1d'),
 'age': 18,
 'friends': ['Applejack', 'Bill'],
 'gender': 'f',
 'horn': True,
 'name': 'Rarity',
 'residence': {'address': '25 Gandolfini Lane', 'town': 'Ponyville'},
 'wings': True}


The same selector strategies can be used for more complex queries in `pymongo`

In [9]:
friend_of_twilight = ponies.find_one({'friends': 'TwilightSparkle'})
pprint.pprint(friend_of_twilight)

{'_id': ObjectId('5a2e0a7e9c418975e80caf1a'),
 'age': 20,
 'friends': ['TwilightSparkle'],
 'horn': True,
 'name': 'PrincessCelestia',
 'wings': True}


To get multiple results back, use `find`, which returns an iterator.

In [10]:
friends_of_twilight = ponies.find({'friends': 'TwilightSparkle'})
for friend in friends_of_twilight:
    print(friend["name"])

PrincessCelestia
Applejack
Applejack
RainbowDash


In [19]:
extreme_ponies = ponies.find({'$or': [{"age" : {'$lt': 18}}, {"age" : {"$gt": 40}}]}, {"_id": 0, "name": 1, "age": 1})
for pony in extreme_ponies:
    pprint.pprint(pony)

{'age': 71, 'name': 'Applejack'}
{'age': 16, 'name': 'TwilightSparkle'}
{'age': 15.0, 'name': 'Applejack'}
{'age': 16, 'name': 'RainbowDash'}


In [21]:
special_ages = ponies.find({'age': {"$in" : [15, 18]}}, {"_id": 0, "name": 1, "age": 1})
for pony in special_ages:
    pprint.pprint(pony)

{'age': 18, 'name': 'Rarity'}
{'age': 15.0, 'name': 'Applejack'}


**Exercise:** Find all the ponies that have a horn and wings.

In [None]:
#Code goes here

A pretty important SQL functionality that we haven't mentioned so far is `GROUP BY`.  MongoDB provides the aggregation pipeline for this purpose.  [Check it out](https://docs.mongodb.com/v3.4/core/aggregation-pipeline/)
if you're interested!  No worries, you won't need it for the pair assignment today.