# Upsert DataFrame to MongoDB Step by Step
>MongoDB is a rich document-oriented NoSQL database. I spent some time diving into it. Here I share my gained knowledge in the form of the exercise.

- toc: true
- branch: master
- badges: true
- categories: [NoSQL, Python]
<!-- - image: images/similar-words.png -->


# Requirement

I am building some personal application with Python. The requirement is to create or update the MongoDB collection(or table in the relational database) with rows in DataFrame. More specifically, for the rows that don't exist we do the creation based on the keys in the DataFrame. Otherwise the update should be performed.

In [5]:
#hide
from pymongo import MongoClient, UpdateOne, collection
from pandas import DataFrame
from numpy import where
from datetime import datetime
    
CONN_STR = 'mongodb+srv://fin-market.x5tm4.mongodb.net/myFirstDatabase?authSource=%24external&authMechanism=MONGODB-X509&tls=true&tlsCertificateKeyFile=../../../drivers/X509-cert-fin-market.pem'

In [6]:
def show_collection(collection):
    display(DataFrame([row for row in collection.find()]))

# Setup MongoDB Atlas Database

Get a free cloud based MongoDB database [here](https://www.mongodb.com/cloud/atlas/lp/try2?utm_source=google&utm_campaign=gs_apac_hong_kong_search_core_brand_atlas_desktop&utm_term=mongodb%20atlas&utm_medium=cpc_paid_search&utm_ad=e&utm_ad_campaign_id=12212624344&adgroup=115749709143&gclid=Cj0KCQiAqbyNBhC2ARIsALDwAsApK8irJVZnesxewSitv8kTagWactqhvZPG4gpz5CKx0JV3Fh1pEIsaAttLEALw_wcB).

Connect to the database and create the test database

In [7]:
DB_NAME = 'test'
COLLECTION_NAME = 'employee'

In [8]:
client = MongoClient(CONN_STR)
client.drop_database(DB_NAME)
db = client[DB_NAME] # switch database
collection = db[COLLECTION_NAME] # get the collection

In [9]:
#hide
now = datetime.now()
emplyee = [['user1',25,'male', now],['user2',55,'male', now],['user3',43,'male', now]]
df_emplyee = DataFrame(emplyee, columns=['name','age','sex', 'lastModifiedAt'])
df_emplyee = df_emplyee[['name','age','sex']]

Prepare a test DataFrame df_emplyee that contains three columns
- name
- age
- sex

In [10]:
df_emplyee

Unnamed: 0,name,age,sex
0,user1,25,male
1,user2,55,male
2,user3,43,male


In [11]:
collection.insert_many(df_emplyee.to_dict("records"))
show_collection(collection)

Unnamed: 0,_id,name,age,sex
0,61b217d096bc99f7804c7a01,user1,25,male
1,61b217d096bc99f7804c7a02,user2,55,male
2,61b217d096bc99f7804c7a03,user3,43,male


## Update & upsert

Change the age of user1 to 32

In [12]:
myquery = { "name": "user1" }
newvalues = { "$set": { "age": "32" }, "$currentDate": {"lastModifiedAt": { "$type": "date" }} }
collection.update_one(myquery, newvalues)

<pymongo.results.UpdateResult at 0x7f4b73fe9340>

In [13]:
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,32,male,2021-12-09 14:50:57.251
1,61b217d096bc99f7804c7a02,user2,55,male,NaT
2,61b217d096bc99f7804c7a03,user3,43,male,NaT


In [14]:
collection.update_one({"name":"user1"}, 
                      {"$set":{"age":32},
                              "$currentDate": 
                              {"lastModifiedAt": { "$type": "date" }}
                      }, 
                      upsert=True)
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,32,male,2021-12-09 14:50:57.655
1,61b217d096bc99f7804c7a02,user2,55,male,NaT
2,61b217d096bc99f7804c7a03,user3,43,male,NaT


In [15]:
collection.update_one({"name":"user4"}, 
                      {"$set":{"age":32}, 
                              "$setOnInsert":{"sex":"female"},
                              "$currentDate":{"lastModifiedAt": { "$type": "date" }}
                      }, 
                      upsert=True)
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,32,male,2021-12-09 14:50:57.655
1,61b217d096bc99f7804c7a02,user2,55,male,NaT
2,61b217d096bc99f7804c7a03,user3,43,male,NaT
3,61b217d212c824e29e2d3fbc,user4,32,female,2021-12-09 14:50:58.072


# Bulk update

In [16]:
df_emplyee = df_emplyee.append({'name':'user5','age': 65, 'sex':'male'},ignore_index=True)


updates = []
df_emplyee.apply(
        lambda row: updates.append(
            UpdateOne(
                {"name": row.get("name")}, 
                {"$set": row.to_dict(), 
                         "$currentDate":{"lastModifiedAt": { "$type": "date" }}
                }, 
                upsert=True
            )),
        axis=1)
collection.bulk_write(updates)
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,25,male,2021-12-09 14:50:58.445
1,61b217d096bc99f7804c7a02,user2,55,male,2021-12-09 14:50:58.445
2,61b217d096bc99f7804c7a03,user3,43,male,2021-12-09 14:50:58.445
3,61b217d212c824e29e2d3fbc,user4,32,female,2021-12-09 14:50:58.072
4,61b217d212c824e29e2d3fcb,user5,65,male,2021-12-09 14:50:58.445


# DataFrame upsert

In [17]:
def df_upsert(df:DataFrame, collection, keys:[]):
    def row_query(row, keys ):
        res = {}
        for key in keys:
            res[key] = row.get(key)
        return res
    updates = []
    df_emplyee.apply(
        lambda row: updates.append(
            UpdateOne(
                row_query(row, keys), 
                {'$set': row.to_dict(),
                    "$currentDate":{"lastModifiedAt": { "$type": "date" }}
                }, 
                upsert=True)), 
            axis=1
    )
    collection.bulk_write(updates)


In [18]:
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,25,male,2021-12-09 14:50:58.445
1,61b217d096bc99f7804c7a02,user2,55,male,2021-12-09 14:50:58.445
2,61b217d096bc99f7804c7a03,user3,43,male,2021-12-09 14:50:58.445
3,61b217d212c824e29e2d3fbc,user4,32,female,2021-12-09 14:50:58.072
4,61b217d212c824e29e2d3fcb,user5,65,male,2021-12-09 14:50:58.445


In [19]:
df_emplyee = df_emplyee.append({'name':'user6','age': 37, 'sex':'female'},ignore_index=True)
df_emplyee

Unnamed: 0,name,age,sex
0,user1,25,male
1,user2,55,male
2,user3,43,male
3,user5,65,male
4,user6,37,female


In [20]:
df_upsert(df_emplyee, collection, ['name'])
show_collection(collection)

Unnamed: 0,_id,name,age,sex,lastModifiedAt
0,61b217d096bc99f7804c7a01,user1,25,male,2021-12-09 14:51:03.115
1,61b217d096bc99f7804c7a02,user2,55,male,2021-12-09 14:51:03.115
2,61b217d096bc99f7804c7a03,user3,43,male,2021-12-09 14:51:03.115
3,61b217d212c824e29e2d3fbc,user4,32,female,2021-12-09 14:50:58.072
4,61b217d212c824e29e2d3fcb,user5,65,male,2021-12-09 14:51:03.116
5,61b217d712c824e29e2d4061,user6,37,female,2021-12-09 14:51:03.116
