# MongoDB

Based on https://docs.mongodb.com/getting-started 


MongoDB is a **NoSQL** open-source **document database**.  MongoDB provides horizontal scaling by replicating and partitioning the data over multiple nodes. This can improve the reliability and scalability of the system.

A record in MongoDB is a **document**, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects or Python dictionaries. The values of fields may include other documents, arrays, and arrays of documents.

This is an example of a document:
```JSON
{
   "_id" : ObjectId("54c955492b7c8eb21818bd09"),
   "address" : {
      "street" : "2 Avenue",
      "zipcode" : "10075",
      "building" : "1480",
      "coord" : [ -73.9557413, 40.7720266 ]
   },
   "borough" : "Manhattan",
   "cuisine" : "Italian",
   "grades" : [
      {
         "date" : ISODate("2014-10-01T00:00:00Z"),
         "grade" : "A",
         "score" : 11
      },
      {
         "date" : ISODate("2014-01-16T00:00:00Z"),
         "grade" : "B",
         "score" : 17
      }
   ],
   "name" : "Vella",
   "restaurant_id" : "41704620"
}
```
In MongoDB, documents have a unique **_id** field that acts as a primary key. MongoDB automatically adds a unique _id to each document if you are not providing it by yourself.

MongoDB stores documents in **collections**. Collections are analogous to tables in relational databases. Unlike a table, however, a collection does not require its documents to have the same schema.

You can start a Docker image with MongoDB like this:
```bash
docker run -p 27017:27017 -d mongo:4.0-xenial
```

In production you really (!) would need to enable authentication with username and password, but for development purposes this is fine.

In [1]:
# Install the pymongo Python Package 
# !pip3 install pymongo

In [2]:
from pymongo import MongoClient
from pprint import pprint
import requests 

In [3]:
# Client connects to "localhost" by default 
client = MongoClient()
# Create local "bipm" database on the fly 
db = client['bipm']

In [4]:
# When we rerun the whole notebook, we start from scratch 
# by dropping the colection "courses"
db.courses.drop()

In [5]:
# Create a Python Dictonary
courses = [
    {'title': 'Data Science',
     'lecturer': {
         'name': 'Markus Löcher',
         'department': 'Math',
         'status': 'Professor'
     },
     'semester': 1},
    {'title': 'Data Warehousing',
     'lecturer': {
         'name': 'Roland M. Mueller',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 1},
    {'title': 'Business Process Management',
     'lecturer': {
         'name': 'Frank Habermann',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 1},
    {'title': 'Stratigic Issues of IT',
     'lecturer': {
         'name': 'Sven Pohland',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 1},
    {'title': 'Text, Web and Social Media Analytics Lab',
     'lecturer': {
         'name': 'Markus Löcher',
         'department': 'Math',
         'status': 'Professor'
     },
     'semester': 2},
    {'title': 'Enterprise Architectures for Big Data',
     'lecturer': {
         'name': 'Roland M. Mueller',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 2},
    {'title': 'Business Process Integration Lab',
     'lecturer': {
         'name': 'Frank Habermann',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 2},
    {'title': 'IT-Security and Privacy',
     'lecturer': {
         'name': 'Dennis Uckel',
         'department': 'Information Systems',
         'status': 'External'
     },
     'semester': 2},
    {'title': 'Research Methods',
     'lecturer': {
         'name': 'Marcus Birkenkrahe',
         'department': 'Information Systems',
         'status': 'Professor'
     },
     'semester': 2},
]

In [6]:
pprint(courses)

[{'lecturer': {'department': 'Math',
               'name': 'Markus Löcher',
               'status': 'Professor'},
  'semester': 1,
  'title': 'Data Science'},
 {'lecturer': {'department': 'Information Systems',
               'name': 'Roland M. Mueller',
               'status': 'Professor'},
  'semester': 1,
  'title': 'Data Warehousing'},
 {'lecturer': {'department': 'Information Systems',
               'name': 'Frank Habermann',
               'status': 'Professor'},
  'semester': 1,
  'title': 'Business Process Management'},
 {'lecturer': {'department': 'Information Systems',
               'name': 'Sven Pohland',
               'status': 'Professor'},
  'semester': 1,
  'title': 'Stratigic Issues of IT'},
 {'lecturer': {'department': 'Math',
               'name': 'Markus Löcher',
               'status': 'Professor'},
  'semester': 2,
  'title': 'Text, Web and Social Media Analytics Lab'},
 {'lecturer': {'department': 'Information Systems',
               'name': 'Roland M. Mu

## insert_many() and insert_one()

You can use the insert_one() method and the insert_many() method to add documents to a collection in MongoDB. If you attempt to add documents to a collection that does not exist, MongoDB will create the collection for you.


In [7]:
db.courses.insert_many(courses)

<pymongo.results.InsertManyResult at 0x10ef3d8c8>

## find()

You can use the find() method to issue a query to retrieve data from a collection in MongoDB. All queries in MongoDB have the scope of a single collection.
Queries can return all documents in a collection or only the documents that match a specified filter or criteria. You can specify the filter or criteria in a document and pass as a parameter to the find() method.
The find() method returns query results in a cursor, which is an iterable object that yields documents.

To return all documents in a collection, call the find() method without a criteria document. 

In [8]:
cursor = db.courses.find()
for document in cursor:
    pprint(document)

{'_id': ObjectId('5cbd6d4d6879c22b4da9e851'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löcher',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Data Science'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e852'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Roland M. Mueller',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Data Warehousing'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e853'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Frank Habermann',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Business Process Management'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e854'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Sven Pohland',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Stratigic Issues of IT'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e855'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löche

In [9]:
another_course = {'title': 'Master Thesis', 'semester': 3}

In [10]:
db.courses.insert_one(another_course)

<pymongo.results.InsertOneResult at 0x10ef3db08>

In [11]:
cursor = db.courses.find()
for document in cursor:
    pprint(document)

{'_id': ObjectId('5cbd6d4d6879c22b4da9e851'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löcher',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Data Science'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e852'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Roland M. Mueller',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Data Warehousing'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e853'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Frank Habermann',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Business Process Management'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e854'),
 'lecturer': {'department': 'Information Systems',
              'name': 'Sven Pohland',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Stratigic Issues of IT'}
{'_id': ObjectId('5cbd6d4d6879c22b4da9e855'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löche

The query condition in `find()` and `find_one()` for an equality match on a field has the following form:
{ <field1>: <value1>, <field2>: <value2>, ... }

* `find()` will find all matching documents.
* `find_one()` will return the first matching document-


In [12]:
result =  db.courses.find_one({"title": "Data Science"})
pprint(result)

{'_id': ObjectId('5cbd6d4d6879c22b4da9e851'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löcher',
              'status': 'Professor'},
 'semester': 1,
 'title': 'Data Science'}


The return value of `find_one()` is just a dictonary.

In [13]:
type(result)

dict

In [14]:
print(result["_id"])
print(result["lecturer"]["name"])

5cbd6d4d6879c22b4da9e851
Markus Löcher


In [15]:
result =  db.courses.find_one({"semester": 2})
pprint(result)

{'_id': ObjectId('5cbd6d4d6879c22b4da9e855'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löcher',
              'status': 'Professor'},
 'semester': 2,
 'title': 'Text, Web and Social Media Analytics Lab'}


`find()` will return an iterator that you can loop through.

In [16]:
cursor = db.courses.find({"semester": 2})
for document in cursor:
    print(document['title'])

Text, Web and Social Media Analytics Lab
Enterprise Architectures for Big Data
Business Process Integration Lab
IT-Security and Privacy
Research Methods


Print all name of all lecturers in the second semester:

In [31]:
# TODO: 
cursor = db.courses.find({"semester": 2})
for document in cursor:
    print(document['lecturer']['name'])

Print all course titles of all courses that Frank Habermann teaches:

In [32]:
# TODO: 
cursor = db.courses.find({"lecturer.name": "Frank Habermann"})
for document in cursor:
    print(document['title'], "in Semester", document['semester'])

You can specify a logical conjunction (AND) for a list of query conditions by separating the conditions with a comma in the conditions document. 

Like in
```python
cursor = db.restaurants.find({"cuisine": "Italian", "address.zipcode": "10075"})
```

Print all lectures that Frank Habermann teaches in the second semester:

In [33]:
# TODO: 
cursor = db.courses.find({"lecturer.name": "Frank Habermann", 
                          "semester": 2})
for document in cursor:
    print(document['title'], "in Semester", document['semester'])

You can specify a logical disjunction (OR) for a list of query conditions by using the $or query operator.

Like in:
```python
cursor = db.restaurants.find({"$or": [{"cuisine": "Italian"}, {"address.zipcode": "10075"}]})
```

Print all lectures from either Frank Habermann or Markus Löcher:

In [36]:
# TODO: 
cursor = db.courses.find({"$or": 
                          [{"lecturer.name": "Frank Habermann"}, 
                           {"lecturer.name": "Markus Löcher"}]})
for document in cursor:
    print(document['title'], "in Semester", document['semester'])

You can also query with a greater than operator (\$gt) or less than operator (\$lt).  

In [21]:
cursor = db.courses.find({"semester": {"$gt": 1}})
for document in cursor:
    print("Semester", document['semester'], ":", document['title'])

Semester 2 : Text, Web and Social Media Analytics Lab
Semester 2 : Enterprise Architectures for Big Data
Semester 2 : Business Process Integration Lab
Semester 2 : IT-Security and Privacy
Semester 2 : Research Methods
Semester 3 : Master Thesis


In [22]:
db.courses.count_documents({"semester": 2})

5

# Downloading Nobel Prize Winners with an API and storing them in MongoDB

![](https://upload.wikimedia.org/wikipedia/en/e/ed/Nobel_Prize.png)
The Nobel Prize offers a Web API https://nobelprize.readme.io/docs/prize

Because the API is giving us JSON and MongoDB is able to store documents in a JSON-like format, using a document database like MongoDB seems like a good fit to store the results of the API.  You can get all laureates at http://api.nobelprize.org/v1/laureate.json and all prizes at http://api.nobelprize.org/v1/prize.json

We will just download all laureates and prizes and store them in MongoDB!

In [23]:
# Create local "nobel" database on the fly 
db = client["nobel"]
db.prizes.drop()
db.laureates.drop()
# API documented at https://nobelprize.readme.io/docs/prize 
for collection_name in ["prizes", "laureates"]:
    singular = collection_name[:-1] # the API uses singular
    response = requests.get( "http://api.nobelprize.org/v1/{}.json".format(singular)) 
    documents = response.json()[collection_name] 
    # Create collections on the fly 
    db[collection_name].insert_many(documents)

In [24]:
pprint(db.laureates.find_one({}))
result = db.laureates.find_one({})
result["firstname"]

{'_id': ObjectId('5cbd6d4f6879c22b4da9eaa9'),
 'born': '1845-03-27',
 'bornCity': 'Lennep (now Remscheid)',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'died': '1923-02-10',
 'diedCity': 'Munich',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'firstname': 'Wilhelm Conrad',
 'gender': 'male',
 'id': '1',
 'prizes': [{'affiliations': [{'city': 'Munich',
                               'country': 'Germany',
                               'name': 'Munich University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services he '
                           'has rendered by the discovery of the remarkable '
                           'rays subsequently named after him"',
             'share': '1',
             'year': '1901'}],
 'surname': 'Röntgen'}


'Wilhelm Conrad'

In [25]:
db.laureates.count_documents({"gender": "female"})

51

In [26]:
db.laureates.distinct("bornCountry", {"bornCountry": {"$regex": "Germany"}})

['Prussia (now Germany)',
 'Hesse-Kassel (now Germany)',
 'Germany',
 'Schleswig (now Germany)',
 'Germany (now Poland)',
 'Germany (now France)',
 'West Germany (now Germany)',
 'Bavaria (now Germany)',
 'Germany (now Russia)',
 'Mecklenburg (now Germany)',
 'W&uuml;rttemberg (now Germany)',
 'East Friesland (now Germany)']

In [27]:
db.laureates.count_documents({"bornCountry": {"$regex": "Germany"}})

95

In [28]:
cursor = db.laureates.find({"bornCountry": {"$regex": "Germany"}, "prizes.category": "physics"})
for document in cursor:
    print(document["prizes"][0]["year"], document["firstname"], document["surname"])

1901 Wilhelm Conrad Röntgen
1909 Karl Ferdinand Braun
1914 Max von Laue
1918 Max Karl Ernst Ludwig Planck
1919 Johannes Stark
1921 Albert Einstein
1925 James Franck
1925 Gustav Ludwig Hertz
1932 Werner Karl Heisenberg
1943 Otto Stern
1954 Max Born
1954 Walther Bothe
1955 Polykarp Kusch
1961 Rudolf Ludwig Mössbauer
1963 Maria Goeppert Mayer
1963 J. Hans D. Jensen
1966 Alfred Kastler
1967 Hans Albrecht Bethe
1978 Arno Allan Penzias
1986 Ernst Ruska
1986 Gerd Binnig
1987 J. Georg Bednorz
1988 Jack Steinberger
1989 Hans G. Dehmelt
1989 Wolfgang Paul
1998 Horst L. Störmer
2000 Herbert Kroemer
2001 Wolfgang Ketterle
2005 Theodor W. Hänsch
2017 Rainer Weiss


In [29]:
pprint(db.laureates.find_one({"firstname": "Malala"}))

{'_id': ObjectId('5cbd6d4f6879c22b4da9ee1f'),
 'born': '1997-07-12',
 'bornCity': 'Mingora',
 'bornCountry': 'Pakistan',
 'bornCountryCode': 'PK',
 'died': '0000-00-00',
 'firstname': 'Malala',
 'gender': 'female',
 'id': '914',
 'prizes': [{'affiliations': [[]],
             'category': 'peace',
             'motivation': '"for their struggle against the suppression of '
                           'children and young people and for the right of all '
                           'children to education"',
             'share': '2',
             'year': '2014'}],
 'surname': 'Yousafzai'}


In [30]:
cursor = db.laureates.find({"gender": "female"}).sort([("prizes.year", 1)])
for document in cursor:
    year = document["prizes"][0]["year"]
    firstname = document["firstname"]
    surname = document.get("surname", "")
    print(year, firstname, surname)

1903 Marie Curie, née Sklodowska
1905 Baroness Bertha Sophie Felicita von Suttner, née Countess Kinsky von Chinic und Tettau
1909 Selma Ottilia Lovisa Lagerlöf
1926 Grazia Deledda
1928 Sigrid Undset
1931 Jane Addams
1935 Irène Joliot-Curie
1938 Pearl Buck
1945 Gabriela Mistral
1946 Emily Greene Balch
1947 Gerty Theresa Cori, née Radnitz
1963 Maria Goeppert Mayer
1964 Dorothy Crowfoot Hodgkin
1966 Nelly Sachs
1976 Betty Williams
1976 Mairead Corrigan
1977 Rosalyn Yalow
1979 Mother Teresa 
1982 Alva Myrdal
1983 Barbara McClintock
1986 Rita Levi-Montalcini
1988 Gertrude B. Elion
1991 Aung San Suu Kyi 
1991 Nadine Gordimer
1992 Rigoberta Menchú Tum
1993 Toni Morrison
1995 Christiane Nüsslein-Volhard
1996 Wislawa Szymborska
1997 Jody Williams
2003 Shirin Ebadi
2004 Linda B. Buck
2004 Elfriede Jelinek
2004 Wangari Muta Maathai
2007 Doris Lessing
2008 Françoise Barré-Sinoussi
2009 Elizabeth H. Blackburn
2009 Carol W. Greider
2009 Ada E. Yonath
2009 Herta Müller
2009 Elinor Ostrom
2011 Ellen J