### DATA ENGINEERING PLATFORMS (MSCA 31012)
### File        :   Class Exercise - Session 7 - PythonMongoClient
### Desc     :  Connecting to MongoDB via Jupyter Notebook
### Authors:  Shreenidhi Bharadwaj
### Date      :   11/15/2018                     

References: 
https://docs.mongodb.com/getting-started/python/client/

Installation:
`pip install pymongo`

pymongo is an interface for connecting to a Mongo database server from Python. The steps are as follows:

1. Install and start MongoDB on your local machine.
2. Make sure to run mongod with the data folder option
    "C:\Program Files\MongoDB\Server\3.4\bin\mongod.exe" --dbpath "C:\data"

3. Download file 
https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

4. Import downloaded sample data into Mongo DB (25359 documents)
mongoimport --db test --collection restaurants --drop --file "C:\Users\SBharadwaj\Desktop\Shree\DEPA\03-Lectures\7\InClass Exercises\primer-dataset.json"

In [1]:
#import libraries
import pymongo
import json
from pymongo import MongoClient

### Connect to MongoDB

In [2]:
#connect to local database server
client = MongoClient()

#switch to test DB
db = client.test

In [3]:
# function to print only first n documents (to avoid perf/memory issues)
def printhead(cursor, n):
    for idx,document in enumerate(cursor):
        if idx <= n: 
            print(document)
        else:
            break

### Query MongoDB

In [4]:
# List the first 2 documents in the db 
restaurants = db.restaurants.find()
printhead(restaurants, 2)

{'_id': ObjectId('5bef6d18a6b0511f7dbc161b'), 'address': {'building': '351', 'coord': [-73.98513559999999, 40.7676919], 'street': 'West   57 Street', 'zipcode': '10019'}, 'borough': 'Manhattan', 'cuisine': 'Irish', 'grades': [{'date': datetime.datetime(2014, 9, 6, 0, 0), 'grade': 'A', 'score': 2}, {'date': datetime.datetime(2013, 7, 22, 0, 0), 'grade': 'A', 'score': 11}, {'date': datetime.datetime(2012, 7, 31, 0, 0), 'grade': 'A', 'score': 12}, {'date': datetime.datetime(2011, 12, 29, 0, 0), 'grade': 'A', 'score': 12}], 'name': 'Dj Reynolds Pub And Restaurant', 'restaurant_id': '30191841'}
{'_id': ObjectId('5bef6d18a6b0511f7dbc161c'), 'address': {'building': '469', 'coord': [-73.961704, 40.662942], 'street': 'Flatbush Avenue', 'zipcode': '11225'}, 'borough': 'Brooklyn', 'cuisine': 'Hamburgers', 'grades': [{'date': datetime.datetime(2014, 12, 30, 0, 0), 'grade': 'A', 'score': 8}, {'date': datetime.datetime(2014, 7, 1, 0, 0), 'grade': 'B', 'score': 23}, {'date': datetime.datetime(2013, 4

#### Cut paste the JSON results into a JSON formatter ( URL below ) and click on format to get a clean view of the data
http://jsonviewer.stack.hu/  ( Pretty JSON ) 

In [9]:
restaurantData.data

AttributeError: 'Cursor' object has no attribute 'data'

In [6]:
# List all documents in the restaurant collection where borough is Manhattan
restaurantData = db.restaurants.find({"borough": "Manhattan"})
r = json.loads(str(restaurantData))
#printhead(restaurantData, 2)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [24]:
#### Sort the query results based on borougn and zipcode
restaurantData = db.restaurants.find().sort([
    ("borough", pymongo.ASCENDING),
    ("address.zipcode", pymongo.ASCENDING)
])
printhead(restaurantData, 2)

{'_id': ObjectId('5be7b4e1e0b0663ca0536697'), 'address': {'building': '650', 'coord': [-73.92537449999999, 40.8207116], 'street': 'Grand Concourse', 'zipcode': ''}, 'borough': 'Bronx', 'cuisine': 'Sandwiches', 'grades': [{'date': datetime.datetime(2014, 9, 30, 0, 0), 'grade': 'A', 'score': 7}], 'name': 'Subway#50497 (Cardinal Hayes High School)', 'restaurant_id': '50006048'}
{'_id': ObjectId('5be7b4e0e0b0663ca053139c'), 'address': {'building': '72', 'coord': [-73.92506, 40.8275556], 'street': 'East  161 Street', 'zipcode': '10451'}, 'borough': 'Bronx', 'cuisine': 'American ', 'grades': [{'date': datetime.datetime(2014, 4, 15, 0, 0), 'grade': 'A', 'score': 9}, {'date': datetime.datetime(2013, 11, 14, 0, 0), 'grade': 'A', 'score': 4}, {'date': datetime.datetime(2013, 7, 29, 0, 0), 'grade': 'A', 'score': 10}, {'date': datetime.datetime(2012, 12, 31, 0, 0), 'grade': 'B', 'score': 15}, {'date': datetime.datetime(2012, 5, 30, 0, 0), 'grade': 'A', 'score': 13}, {'date': datetime.datetime(2012

#### Insert data

In [10]:
# insert data relating to a new restaurant
from datetime import datetime
result = db.restaurants.insert_one(
    {
        "address": {
            "street": "2 Avenue",
            "zipcode": "10075",
            "building": "1480",
            "coord": [-73.9557413, 40.7720266]
        },
        "borough": "Manhattan",
        "cuisine": "Italian",
        "grades": [
            {
                "date": datetime.strptime("2014-10-01", "%Y-%m-%d"),
                "grade": "A",
                "score": 11
            },
            {
                "date": datetime.strptime("2014-01-16", "%Y-%m-%d"),
                "grade": "B",
                "score": 17
            }
        ],
        "name": "Vella",
        "restaurant_id": "41704620"
    }
)

In [11]:
#print object type of the result
result

<pymongo.results.InsertOneResult at 0x10879bec8>

In [12]:
#check document that was inserted
result.inserted_id

ObjectId('5bef83cdd972f623f4cc0eaf')

#### Insert more than one documents

In [13]:
result = db.test.insert_many([{
        "address": {
            "street": "2 Avenue",
            "zipcode": "10075",
            "building": "1480",
            "coord": [-72.937413, 40.75466]
        },
        "borough": "Manhattan",
        "cuisine": "Indian",
        "grades": [
            {
                "date": datetime.strptime("2014-10-01", "%Y-%m-%d"),
                "grade": "A",
                "score": 11
            },
            {
                "date": datetime.strptime("2015-05-16", "%Y-%m-%d"),
                "grade": "B",
                "score": 17
            }
        ],
        "name": "India Garden",
        "restaurant_id": "4170462" + str(i)
    } for i in range(4)])

In [14]:
#documents that were inserted
result.inserted_ids

[ObjectId('5bef83d3d972f623f4cc0eb0'),
 ObjectId('5bef83d3d972f623f4cc0eb1'),
 ObjectId('5bef83d3d972f623f4cc0eb2'),
 ObjectId('5bef83d3d972f623f4cc0eb3')]

#### Update documents { update_one(),update_many() }
In MongoDB, a write operation is atomic on the level of a single document, even if the operation modifies multiple embedded documents within a single document. When a single write operation modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic and other operations may interleave.However, you can isolate a single write operation that affects multiple documents using the $isolated operator.

In [15]:
#update document 
result = db.restaurants.update_one(
    {"cuisine": "Indian"},
    {
        "$set": {
            "name": "Mexican Garden"
        },
        "$currentDate": {"lastModified": True}
    }
)

In [16]:
#documents that were updated
print (result.matched_count)
print (result.modified_count)
cursor = db.restaurants.find({"name": "Mexican Garden"})
printhead(cursor, 10)

1
1
{'_id': ObjectId('5bef6d19a6b0511f7dbc174a'), 'address': {'building': '320', 'coord': [-73.977597, 40.779593], 'street': 'Columbus Avenue', 'zipcode': '10023'}, 'borough': 'Manhattan', 'cuisine': 'Indian', 'grades': [{'date': datetime.datetime(2014, 10, 27, 0, 0), 'grade': 'A', 'score': 7}, {'date': datetime.datetime(2013, 7, 29, 0, 0), 'grade': 'A', 'score': 5}, {'date': datetime.datetime(2013, 2, 19, 0, 0), 'grade': 'A', 'score': 11}, {'date': datetime.datetime(2012, 1, 12, 0, 0), 'grade': 'A', 'score': 2}], 'name': 'Mexican Garden', 'restaurant_id': '40370243', 'lastModified': datetime.datetime(2018, 11, 17, 2, 58, 29, 811000)}


####  Replace documents {replace_one()}
To replace the entire document rather than selected fields

In [17]:
#update documents
#After the update, the document only contains the field or fields in the replacement document.

result = db.restaurants.replace_one(
    {"restaurant_id": "41704620"},
    {
        "name": "Mexican Garden",
        "cuisine": "Mexican",
        "address": {
            "coord": [-73.9557413, 40.7720266],
            "building": "1480",
            "street": "2 Avenue",
            "zipcode": "10075"
        }
    }
)

In [18]:
#documents that were updated
print (result.matched_count)
print (result.modified_count)
cursor = db.restaurants.find({"name": "Mexican Garden"})
printhead(cursor, 10)

1
1
{'_id': ObjectId('5bef6d19a6b0511f7dbc174a'), 'address': {'building': '320', 'coord': [-73.977597, 40.779593], 'street': 'Columbus Avenue', 'zipcode': '10023'}, 'borough': 'Manhattan', 'cuisine': 'Indian', 'grades': [{'date': datetime.datetime(2014, 10, 27, 0, 0), 'grade': 'A', 'score': 7}, {'date': datetime.datetime(2013, 7, 29, 0, 0), 'grade': 'A', 'score': 5}, {'date': datetime.datetime(2013, 2, 19, 0, 0), 'grade': 'A', 'score': 11}, {'date': datetime.datetime(2012, 1, 12, 0, 0), 'grade': 'A', 'score': 2}], 'name': 'Mexican Garden', 'restaurant_id': '40370243', 'lastModified': datetime.datetime(2018, 11, 17, 2, 58, 29, 811000)}
{'_id': ObjectId('5bef6d1aa6b0511f7dbc5b16'), 'name': 'Mexican Garden', 'cuisine': 'Mexican', 'address': {'coord': [-73.9557413, 40.7720266], 'building': '1480', 'street': '2 Avenue', 'zipcode': '10075'}}


####  Data Aggregation, Grouping & Sorting 
Documents enter a multi-stage pipeline that transforms the documents into aggregated results

In [37]:
# Groups documents by city and get counts of each sorted in descending order
cursor = db.restaurants.aggregate(
    [ 
        { '$group': { '_id': "$borough", "count": { '$sum': 1 } } },
        { '$sort' : {'total' : -1} }
    ]
)
printhead(cursor, 10)

{'_id': 'Manhattan', 'count': 10259}
{'_id': 'Missing', 'count': 51}
{'_id': 'Queens', 'count': 5656}
{'_id': 'Bronx', 'count': 2338}
{'_id': 'Staten Island', 'count': 969}
{'_id': None, 'count': 1}
{'_id': 'Brooklyn', 'count': 6086}


In [38]:
# find a list of restaurants located in the Bronx, grouped by restaurant category
cursor = db.restaurants.aggregate( 
      [ 
          { '$match': { "borough": "Bronx" } },
#           { '$unwind': '$categories'},
          { '$group': { '_id': "$categories", 'Bronx restaurants': { '$sum': 1 } } }
      ]  )
printhead(cursor, 10)

{'_id': None, 'Bronx restaurants': 2338}


In [39]:
# The following pipeline uses $match to query the restaurants collection for documents with borough 
# equal to "Queens" and cuisine equal to Brazilian. The _id field contains the distinct zipcode value.
cursor = db.restaurants.aggregate(
   [
     { '$match': { "borough": "Queens", "cuisine": "Brazilian" } },
     { '$group': { "_id": "$address.zipcode" , "count": { '$sum': 1 } } }
   ] )
printhead(cursor, 10)

{'_id': '11368', 'count': 1}
{'_id': '11101', 'count': 2}
{'_id': '11106', 'count': 3}
{'_id': '11377', 'count': 1}
{'_id': '11103', 'count': 1}


### Delete documents

delete_one(), delete_many()

In [40]:
#delete one document where name matches condition
db.restaurants.delete_one({"name": "India Garden"})

<pymongo.results.DeleteResult at 0x17acf615288>

In [41]:
#delete all documents where name matches condition
db.restaurants.delete_many({"name": "India Garden"})

<pymongo.results.DeleteResult at 0x17acf5bbbc8>

In [42]:
#delete all documents - empties the DB
db.restaurants.delete_many({})

<pymongo.results.DeleteResult at 0x17acf394688>

In [43]:
#find the first 10 documents in the db - none should be found since the data was deleted 
cursor = db.restaurants.find()
printhead(cursor, 10)

### Explore Further

https://docs.mongodb.com/manual/core/bulk-write-operations/

https://docs.mongodb.com/manual/reference/sql-comparison/