# NoSQL databases in Python: Use MongoDB and `mongoengine`

## Resources

- https://realpython.com/introduction-to-mongodb-and-python/#using-mongodb-with-python-and-mongoengine
- https://docs.mongoengine.org/tutorial.html
- https://medium.com/swlh/setting-up-mongodb-in-command-line-c41874f2f9c0
- https://docs.mongodb.com/guides/server/install/
- https://muralidba.blogspot.com/2018/04/files-in-wiredtiger-database.html

## Installation

## Set up MongoDB database (`mongodb`)

In your terminal:

```bash
mkdir some/dir/for/the/db
mongod --dbpath some/dir/for/the/db
```

This will setup a MongoDB database (from scratch or existing) which will be used by the `mongoengine` package.

## Quick start

In [1]:
from mongoengine import connect
from mongoengine import Document, ListField, StringField, URLField
connect(db="rptutorials", host="localhost", port=27017)
class Tutorial(Document):
    title = StringField(required=True, max_length=70)
    author = StringField(required=True, max_length=20)
    contributors = ListField(StringField(max_length=20))
    url = URLField(required=True)
tutorial1 = Tutorial(
    title="Beautiful Soup: Build a Web Scraper With Python",
    author="Martin",
    contributors=["Aldren", "Geir Arne", "Jaya", "Joanna", "Mike"],
    url="https://realpython.com/beautiful-soup-web-scraper-python/"
)
tutorial1.save()  

<Tutorial: Tutorial object>

## Connect to database (`mongoengine`)

If database does not exist yet, create and connect to it.

In [2]:
from mongoengine import connect, disconnect

In [3]:
connect(db="rptutorials", host="localhost", port=27017)

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

## Define database schema using documents (`mongoengine`)

> In MongoDB, a document is roughly equivalent to a row in an RDBMS. When working with relational databases, rows are stored in tables, which have a strict schema that the rows follow. MongoDB stores documents in collections rather than tables — the principal difference is that no schema is enforced at a database level.

From https://docs.mongoengine.org/guide/defining-documents.html#defining-a-document-s-schema

In [4]:
from mongoengine import Document, ListField, StringField, URLField

In [5]:
class Tutorial(Document):
    title = StringField(required=True, max_length=70)
    author = StringField(required=True, max_length=20)
    contributors = ListField(StringField(max_length=20))
    url = URLField(required=True)

In [6]:
class User(Document):
    email = StringField(required=True)
    first_name = StringField(max_length=50)
    last_name = StringField(max_length=50)

## Populate database with documents (`mongoengine`)

In [7]:
def create_tutorials(n):
    tutorials = [
        Tutorial(
            title=f"Beautiful Soup: Build a Web Scraper With Python",
            author="Martin",
            contributors=["Aldren", "Geir Arne", "Jaya", "Joanna", "Mike"],
            url="https://realpython.com/beautiful-soup-web-scraper-python/"
        )
        for i in range(0, n)
    ]
    return tutorials

In [8]:
tutorials = create_tutorials(40000)
for tutorial in tutorials:
    tutorial.save()

## Access data in database (`mongoengine`)

In [9]:
Tutorial.objects.count()

5217072

In [10]:
Tutorial.objects(author="Alex").count()

0

In [11]:
User.objects.count()

6

## Populate database again faster

### Bulk insert faster?

In [12]:
n_docs = 40000

#### Use `.save()`

In [13]:
tutorials = create_tutorials(n_docs)
print(len(tutorials))
%time a = [t.save() for t in tutorials]
del a

40000
CPU times: user 9.89 s, sys: 476 ms, total: 10.4 s
Wall time: 12.5 s


#### Use `.insert()`

In [14]:
tutorials = create_tutorials(n_docs)
print(len(tutorials))
%time a = Tutorial.objects.insert(tutorials)
del a

40000
CPU times: user 2.23 s, sys: 32.5 ms, total: 2.26 s
Wall time: 2.41 s


#### Result

_Note_: `.insert()` 6 times faster than `.save()`

### Parallelization?

In [15]:
from multiprocessing import Pool

#### Use `.save()`

In [16]:
Tutorial.objects.count()

5297072

In [17]:
def save_document(document):
    connect(db="rptutorials", host="localhost", port=27017)
    document.save()

n_cores = 4
n_docs = 400000
tutorials = create_tutorials(n_docs)

disconnect()

pool = Pool(processes=n_cores)
%time pool.map(save_document, tutorials)
pool.close()
pool.join()

connect(db="rptutorials", host="localhost", port=27017)

CPU times: user 11.5 s, sys: 153 ms, total: 11.7 s
Wall time: 48.3 s


MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

In [18]:
Tutorial.objects.count()

5697072

#### Use `.insert()`

In [19]:
import itertools

def chunked_iterable(iterable, size):
    it = iter(iterable)
    while True:
        chunk = tuple(itertools.islice(it, size))
        if not chunk:
            break
        yield chunk

In [20]:
n_cores = 4
n_docs = 400000

tutorials = create_tutorials(n_docs)
%time a = [Tutorial.objects.insert(tutorials_chunk) for tutorials_chunk in chunked_iterable(tutorials, 100000)]
del a

CPU times: user 22.2 s, sys: 336 ms, total: 22.6 s
Wall time: 24 s


In [21]:
%time a = Tutorial.objects
del a

CPU times: user 39 µs, sys: 1 µs, total: 40 µs
Wall time: 42.9 µs


In [22]:
%time Tutorial.objects.count()

CPU times: user 1.25 ms, sys: 0 ns, total: 1.25 ms
Wall time: 1.83 s


6097072

In [23]:
tutorials = create_tutorials(n_docs)
%time a = Tutorial.objects.insert(tutorials)
del a
Tutorial.objects.count()

CPU times: user 21.7 s, sys: 368 ms, total: 22 s
Wall time: 23.6 s


6497072