# Introduction to Big Data Modern Technologies course

## TOPIC 2: NoSQL databases practice - MongoDB
### Part 1

### 1. Libraries

In [None]:
import os
import sys
import json
import pymongo

Python library [PyMongo](https://pymongo.readthedocs.io) is used to get access to MongoDB demo database. The first step when working with PyMongo is to create a MongoClient to the running mongod instance:

In [None]:
from pymongo import MongoClient

client = MongoClient() # default setting
client

### 2. MongoClient

In [None]:
client.server_info()

In [None]:
client.HOST

In [None]:
client.nodes

In [None]:
client.list_database_names()

### 3. Databases and collections

In [None]:
db = client['testdb']
db

A collection is a group of documents stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. Here is an example how to get a collection:

In [None]:
collection = db['test_collection']
collection

In [None]:
db.list_collection_names()

### 4. Working with documents

#### 4.1. Look at the data first

We use data from SPBU site about teachers of the University.

In [None]:
!ls -la ~/__DATA/IBDT_Spring_2023/topic_2

In [None]:
file_path = '/home/jovyan/__DATA/IBDT_Spring_2023/topic_2/teachers-20230203.json'
with open(file_path, 'r') as file:
    data = json.load(file)

In [None]:
type(data)

In [None]:
data[0]

In [None]:
print('total records:', len(data))

In [None]:
data[-1]

#### 4.2. Insert a single document

Data in MongoDB is represented (and stored) using JSON-style documents (we know it as Python dictionary). Let's look at single instance of our data first:

In [None]:
one_doc = data[0]
one_doc

In [None]:
type(one_doc)

In [None]:
one_doc.keys()

In [None]:
one_doc['Id']

In [None]:
one_doc['Employments']

To insert a document into a collection `teachers` we can use the `insert_one()` method:

In [None]:
result = db['teachers'].insert_one(one_doc)
result

In [None]:
# was insert a successful?
result.acknowledged

In [None]:
# id of a new document in a collection
result.inserted_id

We can check if documents is in our database:

In [None]:
db.list_collection_names()

Now we see an inserted document (with use `find_one()` method):

In [None]:
db['teachers'].find_one()

In [None]:
db['teachers'].count_documents({})

#### 4.3. Insert many documents at once

In [None]:
len(data)

In [None]:
data[:3]

We can also perform bulk insert operations with use of `insert_many()`:

In [None]:
# this will giva as an error
# guess why?
result = db['teachers'].insert_many(data)
result.inserted_ids

In [None]:
# this will do the job
result = db['teachers'].insert_many(data[1:])
result.inserted_ids

In [None]:
len(result.inserted_ids)

#### 4.5. Basic manipulations

Display first records in database:

In [None]:
count = 0

# NOTE that `find()` is a generator 
# so we need to stop it somehow 

for teacher in db['teachers'].find():
    print(count)
    print(teacher)
    print('---' * 5)
    count += 1
    if count > 5: break
print('the end')

Or we can use `find(...)` to find records by condition: 

In [None]:
for post in db['teachers'].find({'FullName': 'Гаршин Василий Владимирович'}):
    print(post)

If we just want to know how many documents match a query we can perform a `count_documents()` operation instead of a full query:

In [None]:
db['teachers'].count_documents({})

#### 4.6. Full text search

It is not so easy... Look [here](https://www.mongodb.com/docs/manual/core/link-text-indexes/#std-label-text-search-on-premises)

In [None]:
db['teachers'].create_index(
    [('FullName', pymongo.TEXT)], 
    name='search_index', 
    default_language='russian'
)

In [None]:
result = db.teachers.find( { "$text": { "$search": "Гаршин" } } )

In [None]:
list(result)

### 5. Data analysis with Mongo

Here are recommended resources to learn how to deal with data queries in Mongo:
- official MongoDB [manual](https://www.mongodb.com/developer/languages/python/python-quickstart-aggregation/)
- some [recommendations](https://developer.ibm.com/tutorials/analyze-json-data-in-mongodb-with-python/) from IBM
- nice [playground](https://mongoplayground.net/) to improve your skills and test your queries

In [None]:
data[:3]

In [None]:
pipeline = [
    {
        "$match": {
            "FullName": "Гаршин Василий Владимирович"
        }
    }
]
result = db.teachers.aggregate(pipeline)
list(result)

In [None]:
# full text search

pipeline = [
    {
        "$match": { "$text": { "$search": "Гаршин" } }
    }
]
result = db.teachers.aggregate(pipeline)
list(result)

In [None]:
pipeline = [
    {
      "$sort": {
         "FullName": pymongo.DESCENDING
      }
   }
]
result = db.teachers.aggregate(pipeline)
list(result)[:5]

In [None]:
# limited

pipeline = [
    {
        "$sort": {
            "FullName": pymongo.DESCENDING
        }
    },
    {
        "$limit": 1
    }
]
result = db.teachers.aggregate(pipeline)
list(result)

In [None]:
# name duplicates

list(db.teachers.aggregate([
    {
        "$group": {
            "_id": "$FullName",
            "count": {
                "$sum": 1
            }
        }
    },
    {
        "$sort": {
            "count": pymongo.DESCENDING
        }
    }
]))

In [None]:
list(db.teachers.aggregate([
    {
        "$unwind": "$Employments"
    },
    {
        "$group": {
            "_id": "$Employments.Position",
            "count": {
                "$sum": 1
            }
        }
    }
]))

In [None]:
result = db.teachers.aggregate([
    {
        "$unwind": "$Employments"
    },
    {
        "$group": {
            "_id": "$Employments.Position",
            "count": {
                "$sum": 1
            }
        }
    },
    {
        "$sort": {
            "count": pymongo.DESCENDING
        }
    },
    {
        "$limit": 10
    }
])
result = list(result)
result

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

labels = [x['_id'] for x in result]
sizes = [x['count'] for x in result]

plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels)
plt.show()

### 6. Relations in non-relational MongoDB

In [None]:
one_doc = {
    'Id': 2168123, 
    'FullName': 'Гаршин Василий Владимирович', 
    'Course': "Introduction to Big Data Modern Technologies"
}

In [None]:
db['courses'].insert_one(one_doc)

In [None]:
db.list_collection_names()

In [None]:
# all documents with related data

pipeline = [
    {
        "$lookup": {
            "from": "courses", 
            "localField": "Id", 
            "foreignField": "Id", 
            "as": "related_courses",
        }
    }
]
result = db.teachers.aggregate(pipeline)
list(result)

In [None]:
pipeline = [
    {
        "$lookup": {
            "from": "courses", 
            "localField": "Id", 
            "foreignField": "Id", 
            "as": "related_courses",
        }
    },
    {
        "$match": {
        "FullName": "Гаршин Василий Владимирович"
        }
    }
]
result = db.teachers.aggregate(pipeline)
list(result)

### 7. Home assignment

In [None]:
!ls -la ~/__DATA/IBDT_Spring_2023/topic_2

Your home assignment for this part is:
1. Take the large file `events-2022-08-29-2023-02-06-2302022348.json` (about 67 MB but it is ok)
2. Load that file as collection (or table) in database in MongoDB (let's call that database `homedb`)
3. The data contains SPBU timetable. Briefly describe the data you have loaded (key fields? how many documents are in it?). It is not required to describe data in details. __NOTE:__ the data itself is in Russian, but you do not need the knowlege of Russian to name the fields and undercover their purposes
4. Build aggregations to answer following questions below

Questions to answer with use of match and aggregation techniques:
- are there courses tha last all day? (__HINT__: look at `AllDay` field and use `match` to find desired values)
- are there courses with no teachers? (__HINT:__ look at `HasEducators` field)
- find first 5 high loaded time intervals within day (__HINT__: "high loaded" means that most of courses scheduled in that time intervals, use `TimeIntervalString` field and `$group` function)

__NOTE:__ use of JavaScript for data loading pipeline to MongoDB would be a plus