## Tutorial Overview

This tutorial gives a brief introduction to Non-relational database (NoSQL). Specifically it focuses on MongoDB, a document-oriented NoSQL database. It provides a simple example showing how to interact with MongoDB in Python using the [PyMongo](https://api.mongodb.com/python/current/) library. 
We will cover the following topics in this tutorial:
- [Introduction to NoSQL and MongoDB](#Introduction-to-NoSQL-and-MongoDB)
- [Getting Started](#Getting-Started)
- [Query Operations](#Query-Operations)
    - [Insert Document](#Insert-Document)
    - [Update Document](#Update-Document)
    - [Find Document](#Find-Document)
    - [Aggregate Document](#Aggregate-Document)
    - [MapReduce Document](#MapReduce-Document)
    - [Delete Document](#Delete-Document)
- [Summary and References](#Summary-and-References)

## Introduction to NoSQL and MongoDB

### Non-relational Database (NoSQL)
Traditional relational databases (RDBMS) impose strict schema definition. Table structures and data types must be pre-defined at creation time. While RDBMS can effeciently handle organized and structured data, it can't effectively deal with data that are less organized. More importantly it is extremely hard to scale up to clusters of machines for big data and real-time web applications. 

Non-relational databases are designed to do what traditional RDBMS can't do. Most NoSQL databases compromise consistency for availability and partition tolerance. They typically do not enforce a schema when the databases are created. Each item in the databases are stored as an unique attribute name (or 'key') with its corresponding value, which could be actual string values, column sets, semi-structured JSON or XMLs depending on different types of NoSQL databases. Since keys are unique within a collection(table), it is easy to horizontally scale up the database by partioning data based on some hash functions on keys. Because of these traits, NoSQL databases are widely used to handle unstrcutured data (e.g. texts, social media posts, videos).  

### MongoDB
MongoDB is a scalable high-performance open-source, document-orientated NoSQL database. It is called document-orientated as it paris each key with a JSON-like data structure known as a document. Each document can contain many different key-value pairs or key-array pairs and the data structure can be changed over time. 

## Getting Started

To get started, we will first need to install MongoDB. On a Mac machine, we can simply install the MongoDB community edition using `Homebrew`:

    $ brew install mongodb

To install MongoDB on Linux or Windows, please follow the instruction on https://docs.mongodb.com/manual/administration/install-community

After installation, please continue following the instructions to set up the database environment. (__IMPORTANT__: you must create a folder named `/data/db` as the data directory.)

Next we can install the [PyMongo](https://api.mongodb.com/python/current/) library using `pip`:

    $ pip install pymongo
    
After finishing all the installations, make sure the following commands works for you:

In [1]:
import os 
from pymongo import MongoClient
from bson.json_util import loads
from bson.son import SON
from bson.code import Code
import json

We can now finally start interacting with MongoDB. First we will start a MongoDB server. Open a new terminal window and use the following command to start the server.

    $ mongod --dbpath <path-to-database-directory> # Example: mongod --dbpath ./data/db

Next, back to this notebook and start a connection with the database. 

In [2]:
mongo_url = 'mongodb://localhost:27017' # the broadcasting port

client = MongoClient(mongo_url) # connect to MongoDB
client

# Check the MongoDB server terminal to make sure you have establish a connection
# You should expect to see something like '1 connection now open' in the terminal

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

The dataset we will be using is the __*restaurants*__ collection from the __*test*__ database. The dataset is available at https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json. Please get a copy of the dataset and name it `restaurants_dataset.json` and store it in the current directory. This dataset stores basic locational information about restaurants as well as review grades from customers as shown below. 


In [3]:
# glimpse the first data in the dataset

# I add a tiny function to display datetime.date type more elegantly
def date_handler(obj):
    return obj.isoformat() if hasattr(obj, 'isoformat') else obj

with open('restaurants_dataset.json') as f:
    first_line = loads(f.readline())
    print(json.dumps(first_line, indent=2, default=date_handler))

{
  "borough": "Bronx",
  "restaurant_id": "30075445",
  "name": "Morris Park Bake Shop",
  "cuisine": "Bakery",
  "address": {
    "street": "Morris Park Ave",
    "building": "1007",
    "coord": [
      -73.856077,
      40.848447
    ],
    "zipcode": "10462"
  },
  "grades": [
    {
      "grade": "A",
      "date": "2014-03-03T00:00:00+00:00",
      "score": 2
    },
    {
      "grade": "A",
      "date": "2013-09-11T00:00:00+00:00",
      "score": 6
    },
    {
      "grade": "A",
      "date": "2013-01-24T00:00:00+00:00",
      "score": 10
    },
    {
      "grade": "A",
      "date": "2011-11-23T00:00:00+00:00",
      "score": 9
    },
    {
      "grade": "B",
      "date": "2011-03-10T00:00:00+00:00",
      "score": 14
    }
  ]
}


## Basic Operations

### Insert Document
The first step will be to load our JSON data into mongoDB. Similar to the concept of tables in relationship database, it is called 'Collections' in mongoDB and each key-value pair is known as a document. If `_id` is not specified in the document, mongoDB will automatically create an ObjectId as key for this document, which will be unique in this collection. In our case, we will use `restaurant_id` as our unique identifier of documents.
You should expect to see 25359 records in the collection after loading all the data

In [4]:
# each line in our data file represents a document
# we read the file line by line and insert into the database

# if a database or collection is not existed, PyMonogo will create it automatically

my_db = client.demo_db
my_col = my_db.demo_collection

with open('restaurants_dataset.json') as f:
    for line in f:
        doc = loads(line) # read line in json format
        doc['_id'] = doc['restaurant_id'] # set _id as restaurant_id
        my_col.insert_one(doc) # insert into my_col collection

print("total number of documents in the collection: \
      "+str(my_col.count()))


total number of documents in the collection:       25359


### Update Document

As we mentioned above, one of the biggest advantages of NoSQL is its flexibility in handling semi-structured data. If there is a new customer who just visited the Morris Park Bake Shop (the example above) and gave a review to this restaurant (in this case a bad one), we could simply update the document by inserting a new element into this document. Notice that here we can add a `comment` element and do not have a `date` element, but NoSQL has not problem handling this semi-structured data. 

In [5]:
new_review = {"score":1, "grade":"C", "comment":"This is too expensive!"}
restaurant_id = "30075445" # _id for Morris Park Bake Shop

result = client.demo_db.demo_collection.update_one({'_id': restaurant_id}, \
            {"$addToSet": {"grades": new_review}}, upsert=True)

print(result.raw_result)


{'ok': 1.0, 'n': 1, 'nModified': 1, 'updatedExisting': True}


If we want to update a value of an elements in the array (score of the first element in grades field):

In [6]:
restaurant_id = "30075445" # _id for Morris Park Bake Shop

result = client.demo_db.demo_collection.update_one({'_id': restaurant_id}, \
            {"$set": {"grades.0.score": 5}})
print(result.raw_result)

{'ok': 1.0, 'n': 1, 'nModified': 1, 'updatedExisting': True}


### Find Document

Next is about querying the database. Although the syntax is a little bit different from relationship database, the basic idea is the same. 
- If we want to see the document we just updated, we can search by key(since we know its restaurant id) and see the newly-inserted grades in the document.

In [7]:
result = client.demo_db.demo_collection.find_one({"_id":"30075445"})
print(json.dumps(result, indent=2, default=date_handler))

{
  "borough": "Bronx",
  "restaurant_id": "30075445",
  "_id": "30075445",
  "name": "Morris Park Bake Shop",
  "cuisine": "Bakery",
  "address": {
    "street": "Morris Park Ave",
    "building": "1007",
    "coord": [
      -73.856077,
      40.848447
    ],
    "zipcode": "10462"
  },
  "grades": [
    {
      "grade": "A",
      "date": "2014-03-03T00:00:00",
      "score": 5
    },
    {
      "grade": "A",
      "date": "2013-09-11T00:00:00",
      "score": 6
    },
    {
      "grade": "A",
      "date": "2013-01-24T00:00:00",
      "score": 10
    },
    {
      "grade": "A",
      "date": "2011-11-23T00:00:00",
      "score": 9
    },
    {
      "grade": "B",
      "date": "2011-03-10T00:00:00",
      "score": 14
    },
    {
      "grade": "C",
      "score": 1,
      "comment": "This is too expensive!"
    }
  ]
}


- We can also find this restaurant by name, but it might return multiple restaurants that happens to have the same name (although not in this case).

In [8]:
result = client.demo_db.demo_collection.find(
    {"name":"Morris Park Bake Shop"})
print("Matched documents: " + str(result.count()))

Matched documents: 1


- If we are in the Bronx borough and it's lunch time and we want to find a restaurant in this area

In [9]:
result = client.demo_db.demo_collection.find(
    {"borough":"Bronx"})
print("Matched documents: " + str(result.count()))

Matched documents: 2338


- If we are in the Bronx and only interested in Bakery:

In [10]:
result = client.demo_db.demo_collection.find(
    {"borough":"Bronx", "cuisine":"Bakery"})
print("Matched documents: " + str(result.count()))

Matched documents: 71


- If we are in the Bronx and no interested in Bakery:

In [11]:
result = client.demo_db.demo_collection.find(
    {"borough":"Bronx", "cuisine":{"$ne":"Bakery"}})
print("Matched documents: " + str(result.count()))

Matched documents: 2267


- If we want to find restaurants that are either Bakery or American:

In [12]:
result = client.demo_db.demo_collection.find(
    {"cuisine":{"$in": ["Bakery", "American"]}})
print("Matched documents: " + str(result.count()))

Matched documents: 6874


- If we want to do a OR query on different fields: 

In [13]:
result = client.demo_db.demo_collection.find(
    {"$or": [{"cuisine":"Bakery"}, {"borough":"Bronx"}]})
print("Matched documents: " + str(result.count()))

Matched documents: 2958


- If we want to retrieve documents in a specific range, for example we want to find restaurants with zip code starting with 104xx:

In [14]:
result = client.demo_db.demo_collection.find(
    {"address.zipcode":{"$gt": "10400", "$lt": "10500"}})
print("Matched documents: " + str(result.count()))

Matched documents: 2353


Note that if we want to query a file that contains an array with multiple conditional operators, the filed as a whole will match if either a single array element meets the conditions or a combination of array elements meet the conditions.  
- If we want to do an exact-match query on embedded documents(or array), for instance, we would like to find the restaurant based on the coordinate, the following query can do the job. 

In [15]:
# correct coordinate
result = client.demo_db.demo_collection.find(
    {"address.coord": [-73.856077, 40.848447]})
print("Matched documents: " + str(result.count()))

# if we flip the coordinates
result = client.demo_db.demo_collection.find(
    {"address.coord": [40.848447, -73.856077]})
print("Matched documents: " + str(result.count()))

Matched documents: 1
Matched documents: 0


In this case, the embedded document must exactly the field, including the order, which makes sense in our case. 
- But if we want to treat the array as a set and ignore the order: 

In [16]:
# correct coordinate
result = client.demo_db.demo_collection.find(
    {"address.coord": {"$all": [-73.856077, 40.848447]}})
print("Matched documents: " + str(result.count()))

# if we flip the coordinates
result = client.demo_db.demo_collection.find(
    {"address.coord": {"$all": [40.848447, -73.856077]}})
print("Matched documents: " + str(result.count()))

Matched documents: 1
Matched documents: 1


### Aggregate Document
Of course in NoSQL we can also perform aggregations on documents, just like the GROUPBY operation in relational database. 
For example, if we want to know the total number of different cuisine sorted by the frequency:

In [17]:
pipeline = [
        {"$unwind": "$cuisine"},
        {"$group": {"_id": "$cuisine", "count": {"$sum": 1}}},
        {"$sort": SON([("count", -1)])}]

result = client.demo_db.demo_collection.aggregate(pipeline)

count = 0
for r in result:
    if(count >= 5):
        break
    print(r)
    count += 1

{'count': 6183, '_id': 'American'}
{'count': 2418, '_id': 'Chinese'}
{'count': 1214, '_id': 'Café/Coffee/Tea'}
{'count': 1163, '_id': 'Pizza'}
{'count': 1069, '_id': 'Italian'}


### MapReduce Document
Another even cooler feature for MongoDB  is that you can use the mapReduce framework for the aggregation. We can define our own map and reduce functions and apply it on the data. For example, we want to calculate the average score for different level grades. The mapper would produce (grade, score) tuple for each data and the reducer will group tuples that have the same grade together and calculate the average score in the end. 

In [18]:
mapper = Code("""
    function() {
        this.grades.forEach(function(g) {
            emit(g.grade, g.score);
        });
    }""")

reducer = Code("""
               function (key, values) {
                  var total = 0;
                  var count = 0;
                  for (var i = 0; i < values.length; i++) {
                    total += values[i];
                    count += 1
                  }
                  return total / count;
                }
                """)

# result will be stored in the 'output' collection
result = client.demo_db.demo_collection.map_reduce(
    mapper, reducer, "demo_output")

for doc in result.find():
    print(doc)

{'_id': 'A', 'value': 4.310672030330556}
{'_id': 'B', 'value': 16.029527949429514}
{'_id': 'C', 'value': 33.150695668621594}
{'_id': 'Not Yet Graded', 'value': 18.07369160359854}
{'_id': 'P', 'value': 4.801638104421059}
{'_id': 'Z', 'value': 25.672659777312138}


Once you have the average score result, you can tell that this is not a real-world dataset but you get the idea :)

### Delete Document
Finally, if we want to remove some documents from the collection, for example, if we want to remove Bakery from the dataset, we can do the following: 

In [19]:
# remove only one record that match from the collection
drop_results = client.demo_db.demo_collection.delete_one({'cuisine':'Bakery'})
print("Deleted documents using delete_one: "+ str(drop_results.deleted_count))

# remove all the records that match from the collection
drop_results = client.demo_db.demo_collection.delete_many({'cuisine':'Bakery'})
print("Deleted documents using delete_many: "+ str(drop_results.deleted_count))

Deleted documents using delete_one: 1
Deleted documents using delete_many: 690


If we want to drop the entire collection, simple use:

In [20]:
print("Available collection before dropping: ")
cols = client.demo_db.collection_names()
for c in cols:
    print(c)
print("")

client.demo_db.demo_collection.drop()
client.demo_db.demo_output.drop()

print("Available collection after dropping: ")
cols = client.demo_db.collection_names()
for c in cols:
    print(c)
print("")

Available collection before dropping: 
demo_output
demo_collection

Available collection after dropping: 



In the end, you should shutdown the backend MongoDB server by using Ctrl+c or simplying closing the terminial the server is running on.  

## Summary and References

This tutorial only covers the basic funtionalities of MongoDB and how it handles semi-structured data. The tutorial has not talked about another important feature of NoSQL: scalability. I have provided some readings about sharding in the references below. 

1. MongoDB - https://www.mongodb.com
2. PyMongo - https://api.mongodb.com/python/current/
3. High Scalability With MongoDB Sharding - https://dzone.com/articles/divide-and-conquer-high-scalability-with-mongodb-t
4. Restaurant Dataset - https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
5. MongoDB Source Code - https://github.com/mongodb/mongo
6. MongoDB Operator Cheatsheet - https://blog.codecentric.de/files/2012/12/MongoDB-CheatSheet-v1_0.pdf