# PyMongoLab (Student)

We will use the `author` and `catalog` datasets in the `data` directory, and practice some queries on them using `pymongo`. To get ourselves started:

1. Make sure `mongod` is running (it should be from previous exercise)
2. Make sure the books dataset has been entered (from 01_mongo_shell.md)
3. Make sure the catalog dataset has been entered (from 01_mongo_shell.md)


In [1]:
from pymongo import MongoClient
from pprint import pprint

In [2]:
# This creates a client that uses the default port on localhost.
# If connecting to AWS, you need a connection string.
# Can do the same thing with MongoClient("mongodb://localhost:27017")
client = MongoClient()

In [3]:
# Makes it look similar to shell mongo
db = client.books

In [4]:
# This should have the 'author' collection we used before and the 'catalog' collection.
# If not, you haven't connected to the right database, or haven't uploaded the data!
db.collection_names()

  This is separate from the ipykernel package so we can avoid doing imports until


['author']

## Example 1: find

The `find` command is similar to the shell, but it returns a _cursor_. This allows you to scroll through a lot of data at once, without loading it into memory at once. The basic format is
```python
cursor = db.collection.find( where_dictionary, what_fields_dictionary} )
```

Here 
* `where_dictionary` tells us what properties a document needs to be returned. Similar to the WHERE clause in SQL
* `what_fields_dictionary` tells us which fields we want returned. With the exception of `_id`, we either start with nothing and include fields (use `fieldname: True`), or start with everything and exclude fields (use `fieldname: False`). Note `1` and `0` are often used instead of `True` and `False`

To get results our of a cursor:

1. Can convert cursor to list: `list(cursor)`
  * Make sure the data is "small" before doing this. You are asking for _all_ the results at once, which could be large.
  * Defeats the point of a cursor
2. Iterate through the cursor:
```python
for result in cursor:
    # .... do something with the results one at a time
```

In [5]:
# Find the names of each of the authors
cursor = db.author.find({}, {'_id':0, 'name': 1})
list(cursor)

[{'name': '曹雪芹'},
 {'name': 'Harland Sanders'},
 {'name': 'Joanne Rowling'},
 {'name': 'Count Lev Nikolayevich Tolstoy'}]

We can also select from subfields, using the same sort of `field.subfield` syntax that we used in the shell

In [9]:
# List all the book titles with the author
cursor = db.author.find({}, {'_id':0, 'name': 1, 'books.title':1})

list(cursor)

[{'name': '曹雪芹', 'books': [{'title': '红楼梦'}]},
 {'name': 'Harland Sanders', 'books': [{'title': 'Tender Wings of Desire'}]},
 {'name': 'Joanne Rowling',
  'books': [{'title': "Harry Potter and the Philosopher's stone"},
   {'title': 'Harry Potter and the Chamber of secrets'},
   {'title': 'Harry Potter and the Prizoner of Azkeban'},
   {'title': 'Harry Potter and the Goblet of Fire'},
   {'title': 'Harry Potter and the Order of the Phoenix'},
   {'title': 'Harry Potter and the Half-Blood Prince'},
   {'title': 'Harry Potter and the Deathly Hallows'}]},
 {'name': 'Count Lev Nikolayevich Tolstoy',
  'books': [{'title': 'War and Peace'}, {'title': 'Anna Karenina'}]}]

## You do: Exercise 1 (3 minutes)

List each of the author names, with the book titles, the total number of sales for that book, and summary.
e.g. a line of output might be
```python
[....., 
 {'books': [{'title': 'Tender Wings of Desire', 'sold': 10, 'description': .......}], 'name': 'Harland Sanders'},
 ....]
```


HINT: You might want to run 
```python
list(db.author.find().limit(1))
```
first, so that you can see the typical names of the fields.

In [13]:
cursor = db.author.find().limit(1)
cursor = db.author.find({}, # the first dictionary sets a condition (like WHERE in SQL)
                        {'_id':0, 'name':1, 'books.title':1, 'books.sold':1, 'books.description':1}
                        # The second dictionary entry is what yoou want to see (like SELECT column.names in SQL)
                       
                       )

list(cursor)

[{'name': '曹雪芹',
  'books': [{'title': '红楼梦',
    'description': "An aristocratic playboy is torn between the cousin with whom he has a connection, and the cousin he is supposed to marry, as his family's fortunes decline.",
    'sold': 100000000}]},
 {'name': 'Harland Sanders',
  'books': [{'title': 'Tender Wings of Desire',
    'description': 'A KFC marketing stunt for Mother\'s day (one of the busiest days of the year), which is a romance novel in which a Southern Belle falls for the mysterious stranger, Harland "Colonel" Sanders.',
    'sold': 1000}]},
 {'name': 'Joanne Rowling',
  'books': [{'title': "Harry Potter and the Philosopher's stone",
    'description': 'We are introduced to Harry Potter, Voldemort, and British magical society. Sorting hat tells us that it is our choices that define us -- Harry could have either been in Gryffindor or Slytherian.',
    'sold': 120000000},
   {'title': 'Harry Potter and the Chamber of secrets',
    'description': 'We encounter the first horc

### You do: Exercise 2 (5 mins)

The results above are a little ugly. Write a loop that outputs the following statement for each author:
```
<Author name> has sold a total of <sum of all book sales> books
```

You can either use the query above, or (if adventurous) try using an `aggregation`.

In [41]:
cursor = db.author.find({},
                       {'_id':0,
                       'name':1,
                       'books.sold':1}
                       )

list(cursor)

[{'name': '曹雪芹', 'books': [{'sold': 100000000}]},
 {'name': 'Harland Sanders', 'books': [{'sold': 1000}]},
 {'name': 'Joanne Rowling',
  'books': [{'sold': 120000000},
   {'sold': 77000000},
   {'sold': 65000000},
   {'sold': 65000000},
   {'sold': 65000000},
   {'sold': 65000000},
   {'sold': 65000000}]},
 {'name': 'Count Lev Nikolayevich Tolstoy',
  'books': [{'sold': 36000000}, {'sold': 20000000}]}]

In [39]:
cursor = db.author.find({},
                        {'_id':0,
                        'name':1,
                        'books.sold':1}
                       )

cursor[0]

{'name': '曹雪芹', 'books': [{'sold': 100000000}]}

In [40]:
for writer in cursor:
    total_sold = sum([book.get('sold', 0) for book in writer['books']])
    print(f"{writer['name']} has sold a total of {total_sold} books")

曹雪芹 has sold a total of 100000000 books
Harland Sanders has sold a total of 1000 books
Joanne Rowling has sold a total of 522000000 books
Count Lev Nikolayevich Tolstoy has sold a total of 56000000 books


### Insert

If collecting results from webscraping or an API, you might want to insert new books. Let's insert  Chuck Palahniuk into the database.

Note that we have __not__ included his nationality or wikipedia page. This is emphasizing that unlike SQL tables, it is up to us which fields we put in the database. We _should_ include this information (it is relevant here) but the NoSQL design puts it on us to remember to do it -- we are not going to get a warning that fields are "missing"! 

Note that the entries in `books` don't have a number sold either!

We also would not get an error if we renamed `books` to `book`, or `published_books`. It is important to be disciplined when entering data into a NoSQL database

In [42]:
new_document = {
    'name': ' Chuck Palahniuk',
    'books': [{
        'title': 'Fight club',
        'year': 1999,
        'description': 'A man and his imaginary friend make a fight club ..... with soap'
    },
    {
        'title': 'Lullaby',
        'year': 2002,
        'pages': 272,
        'publisher': 'Doubleday',
        'description': 'A lullaby that kills people more effectively than the telephone call in "The Ring"'
    }]
}



## You do: insert the document
db.author.insert_one(new_document)  # inserts just ONE new document (I mean, duh. just look at it)

There is also `insert_many` which you could use to insert a list of new documents.

```python
db.author.insert_many( [new_doc1, new_doc2, ...., new_docN] )
```

## Exercise 3: You do (8 mins)

The code below loads a set of famous lines from the data directory. Add a new collection, `quotes`, to the books database that has documents in the following form:
```python
{'name': <quote author>, 'title': <title of book>, 'quote': <quotation>}
```

In [43]:
# Load and preview the data
quotes = [line.split('|')    # If you open the doc, you will see that the quotes are split by '|'
          for line in open('data/famous_lines.txt', 'r').readlines()  # iterates through every line in this txt file
         ]

quotes[:3]

[['Of Mice And Men',
  'John Steinbeck',
  '"Maybe everâ€™body in the whole damn world is scared of each other."\n'],
 ['Invisible Man',
  'Ralph Ellison',
  '"Life is to be lived, not controlled; and humanity is won by continuing to play in face of certain defeat."\n'],
 ['Wuthering Heights', 'Emily BrontÃ«', '"Terror made me cruel"\n']]

In [45]:
len(quotes)

40

In [49]:
quotes[0]

['Of Mice And Men',
 'John Steinbeck',
 '"Maybe everâ€™body in the whole damn world is scared of each other."\n']

In [50]:
print("Index 0 in quotes list: ", quotes[0][0])
print("Index 1 in quotes list: ", quotes[0][1])
print("Index 2 in quotes list: ", quotes[0][2])

Index 0 in quotes list:  Of Mice And Men
Index 1 in quotes list:  John Steinbeck
Index 2 in quotes list:  "Maybe everâ€™body in the whole damn world is scared of each other."



In [52]:
# Exercise: load into the database
list_of_quotes = [
    {'title': title,
    'author':author,
    'quote':quote.strip()} # strips any leading/lagging spaces
    for title, author, quote in quotes  # pulls out each index in the list
]

db.quotes.insert_many(list_of_quotes)

<pymongo.results.InsertManyResult at 0x197115c4fc8>


## Exercise 4: you do (2 mins)

Find the quotation from 1984

In [55]:
db.quotes.find_one({'title':'1984'})

{'_id': ObjectId('5dd213350ec8e42626fd3df6'),
 'title': '1984',
 'author': 'George Orwell',
 'quote': '"It was a bright cold day in April, and the clocks were striking thirteen."'}

In [57]:
db.quotes.find_one({'title':'1984'}, {'quote':1})

{'_id': ObjectId('5dd213350ec8e42626fd3df6'),
 'quote': '"It was a bright cold day in April, and the clocks were striking thirteen."'}

In [77]:
list(db.quotes.find().limit(2))

[{'_id': ObjectId('5dd213350ec8e42626fd3df1'),
  'title': 'Of Mice And Men',
  'author': 'John Steinbeck',
  'quote': '"Maybe everâ€™body in the whole damn world is scared of each other."'},
 {'_id': ObjectId('5dd213350ec8e42626fd3df2'),
  'title': 'Invisible Man',
  'author': 'Ralph Ellison',
  'quote': '"Life is to be lived, not controlled; and humanity is won by continuing to play in face of certain defeat."'}]

In [69]:
for title in quotes:
    quote = title.get('quote', 'no')
    print(f"{title['title']} has {quote} quote.")

AttributeError: 'list' object has no attribute 'get'

## Exercise 5: we do (2 mins)

Find the number of quotes we loaded. There are lots of ways of doing this, and some are better than others, so we'll do this as a group.

In [65]:
len(list_of_quotes)

40

In [66]:
db.quotes.find().count()

  """Entry point for launching an IPython kernel.


40

## Delete and dropping 

Let's restore the db to how we started it. We are going to
- drop the `quotes` collection
- delete the Chuck Palahniuk document


In [None]:
db.drop_collection('quotes')

In [None]:
db.author.delete_one({'name': 'Chuck Palahniuk'})

## Aggregation

Aggregations allow us to chain a set of operations together. An aggregation takes a list of operations that are executed in order. It is similar to the _pipeline_ that we used in sklearn. Each operation is a dictionary of the form
```python
{ operation_name : operation_arguments}
```

Here are the common operations:

| Operation name | Description | Arguments |
| --- | --- | --- |
| `'$match'` | Acts like a where clause, similar to the first argument in `find`| A dictionary | 
| `'$group'` | Aggregates documents together | A dictionary. Must have an `_id` field to group objects together |
| `'$unwind'` | "Unwinds" a field that has an entry. See description below | A string (name of field to unwind) |
| `'$project'` | Includes or excludes fields. Similar to the second argument of `find`. Can also be used to rename fields | A dictionary |
| `'$sample'` | Samples `n` items randomly | A dictionary of form `{'size': n}` | 
| `'$limit'` | Limits the collection to the first `n` items | A positive integer `n` | 
| `'$sort'` | Sorts the collection based on field names passed | `{'fieldname2': 1, 'fieldname2':1, .... }` |

There are other operations like bucketing (binning data for histograms, skip and offset (for scrolling through data). A complete list can be found here: https://docs.mongodb.com/manual/reference/operator/aggregation/sort/

#### Unwind

Unwind is a little hard to describe, but not difficult to understand. If a field has an array, `$unwind` duplicates the record for each entry in the array. For example, let's say we have the document
```python
{
    'name': 'Vader',
    'campaigns': ['Mufasar', 'Death Star 1', 'Death Star 2'],
    'powers': ['force choke', 'telekinesis'],
    'allegience': 'dark side'
}
```

If we `$unwind` on `campaigns`, we get 3 documents, one for each campaign:
```python
## result of {'$unwind': '$campaigns'} on above
[{
    'name': 'Vader',
    'campaigns': 'Mufasar',
    'powers': ['force choke', 'telekinesis'],
    'allegience': 'dark side'
},
{
    'name': 'Vader',
    'campaigns': 'Death Star 1',
    'powers': ['force choke', 'telekinesis'],
    'allegience': 'dark side'
},
{
    'name': 'Vader',
    'campaigns': 'Death Star 2',
    'powers': ['force choke', 'telekinesis'],
    'allegience': 'dark side'
}]
```

## Exercise 6: Check for understanding (4 mins)


Here is a query to try and find books that sold more than 70,000,000 copies. We use the 
`'books.sold' : {'$gt': 700000}` to select books that sold more than 70000000 copies, yet results were incorrect

**Why doesn't this query work?**


In [79]:
list(db.author.find({'books.sold': {'$gt': 70000000}}, {'books.title': 1, 'name': 1, 'books.sold': 1}))

[{'_id': 4, 'name': '曹雪芹', 'books': [{'title': '红楼梦', 'sold': 100000000}]},
 {'_id': 1,
  'name': 'Joanne Rowling',
  'books': [{'title': "Harry Potter and the Philosopher's stone",
    'sold': 120000000},
   {'title': 'Harry Potter and the Chamber of secrets', 'sold': 77000000},
   {'title': 'Harry Potter and the Prizoner of Azkeban', 'sold': 65000000},
   {'title': 'Harry Potter and the Goblet of Fire', 'sold': 65000000},
   {'title': 'Harry Potter and the Order of the Phoenix', 'sold': 65000000},
   {'title': 'Harry Potter and the Half-Blood Prince', 'sold': 65000000},
   {'title': 'Harry Potter and the Deathly Hallows', 'sold': 65000000}]}]

In [80]:
## Anwer here

Answer is .......

## Exercise 7: We do (8 mins)

Write an aggregation that selects only the best sellers (more than 70 million copies)

In [78]:
pipeline = [
    {'$unwind':'$books'}, # need $books to let mongo know we are referencing books column, not using books as string
    {'$match': {'books.sold':{'$gt':70000000}}},  # acts like a WHERE clause
]

list(db.author.aggregate(pipeline))

[{'_id': 4,
  'name': '曹雪芹',
  'nationality': ['Chinese'],
  'bio': 'https://en.wikipedia.org/wiki/Cao_Xueqin',
  'books': {'title': '红楼梦',
   'alt_title': 'Dream of the Red Chamber',
   'description': "An aristocratic playboy is torn between the cousin with whom he has a connection, and the cousin he is supposed to marry, as his family's fortunes decline.",
   'sold': 100000000}},
 {'_id': 1,
  'name': 'Joanne Rowling',
  'aliases': ['J. K. Rowling', 'Robert Galbraith'],
  'nationality': ['British'],
  'bio': 'https://en.wikipedia.org/wiki/J._K._Rowling',
  'books': {'title': "Harry Potter and the Philosopher's stone",
   'alt_title': "Harry Potter and the Sourcer's stone",
   'description': 'We are introduced to Harry Potter, Voldemort, and British magical society. Sorting hat tells us that it is our choices that define us -- Harry could have either been in Gryffindor or Slytherian.',
   'sold': 120000000}},
 {'_id': 1,
  'name': 'Joanne Rowling',
  'aliases': ['J. K. Rowling', 'Robe

In [None]:
# Incomplete code

pipeline = [
    {'$unwind':'$books'}, # need $books to let mongo know we are referencing books column, not using books as string
    {'$match': {'books.sold':{'$gt':70000000}}},  # acts like a WHERE clause
    {'$project':{'title':'$books.title', 'author':'$name', 'copies_sold':'$books'}}
]

list(db.author.aggregate(pipeline))

## Common trick -- counting records per group

In the Chicago restaurant dataset, we had restaurants with different "price" points. What if we wanted to count how many restaurants were at each price point?

Mongo uses aggregation to solve this. The idea is when grouping, we assign each record the number `1`, then sum them. Let's see an example.

In [None]:
pipeline = [
    {'$group': {'_id': '$price', 'num_restaurants': {'$sum': 1}}},   # group by price, add 1 for every record and store i num_restaurants
    {'$project': {'_id': 0, 'price': '$_id', 'num_restaurants': 1}}, # rename _id to price
    {'$sort': {'price': 1}}                                         # sort by price rating, ascending
]


# The above code is similar to the following SQL query:
# SELECT price, sum(rest), from <table> GROUP BY price

list(client.outings.restaurant.aggregate(pipeline))

There will be practice with aggregations in tomorrow's pair!