# Joint Undergraduate Researcher Onboarding #5
**Topic:** Introduction to maggma to ease working with MongoDB databases

**Date:** April 4, 2022

**Prepared by:** Ann Rutt

# Outline & Relevant Documentation

Onboarding Session Demonstration:
* Getting Stores: Connecting to your Database
* Store `key` Attribute: Specify field for distinguishing documents
* Store `query_one()` and `query()` Method: Accessing database docs
* Store `update()` Method: Storing database docs
* Store `count()` Method
* Store `distinct()` Method: Get distinct values of specified field
* Store `remove_docs()` Method
* Getting a Builder
* Builder `get_items()` Method
* Builder `process_item()` Method
* Builder `update_target()` Method
* Running a Builder for Testing: `run()` Method
* Writing a Custom Builder

Onboarding Independent Exercises Goals:
* Become familiar with the ApproxNEB data stored in `fw_acr_mv/approx_neb`
* Apply MongoDB querying and learn about more advanced querying techniques (https://docs.mongodb.com/manual/reference/operator/query/)
* Practice using common store methods (`query`, `count`, `distinct`)
* Try out developing and testing a builder by creating a custom MapBuilder

For further reading and general reference...
* Maggma Documentation: https://materialsproject.github.io/maggma/
* Maggma GitHub Repo: https://github.com/materialsproject/maggma

# Set-up Check: Mongogrant Credentials

Run the code in this section on your own before the onboarding session to see if everything is working correctly with your `.mongogrant.json` file which should be located in your home directory. 

**You need to be connected to the LBL VPN to successfully run this code.**

In [3]:
from maggma.stores.advanced_stores import MongograntStore
aneb_store = MongograntStore("ro:mongodb07-ext.nersc.gov/fw_acr_mv","approx_neb")
aneb_store.connect()
#aneb_store.query_one({})

The expected output is a long doc print-out. Reach out for help if you run into any errors or issues here so we can troubleshoot individually.

# Onboarding Session Demonstration

If maggma has not been installed yet, run this command in terminal

`pip install maggma`

In [4]:
# maggma is more frequently updated so it is important to check the version you are using
import maggma
print(maggma.__version__)
print(maggma.__path__)

0.50.3
['/Users/tashalewis/miniconda3/envs/torl/lib/python3.9/site-packages/maggma']


In [5]:
# pprint (pretty print) can be helpful for making dictionaries more readable when printed 
from pprint import pprint

We will be using database credentials stored in .mongogrant.json to give you read-only access to Ann's database hosted on NERSC so you can practice working with relevant data.


**Place a copy of the provided .mongogrant.json file in your home directory**

## Store Basics

### Getting Stores: Connecting to your Database

A `MongoStore` object connects to your local MongoDB database

In [6]:
from maggma.stores.mongolike import MongoStore

In [7]:
sandbox_store = MongoStore(database= "local_dev",collection_name= "sandbox")
sandbox_store.connect()

A `MongograntStore` object connects to MongoDB databases (typically hosted on NERSC) with credentials stored in your `.mongogrant` file

In [8]:
from maggma.stores.advanced_stores import MongograntStore

In [9]:
aneb_store = MongograntStore("ro:mongodb07-ext.nersc.gov/fw_acr_mv","approx_neb") # note ro: specifies read-only access vs. rw: for read-write access
aneb_store.connect()

Advanced Tip: A `ConcatStore` can be used to combine multiple databases into a single store object

In [10]:
from maggma.stores.compound_stores import ConcatStore

In [11]:
combined_store = ConcatStore([store_1,store_2...])

SyntaxError: invalid syntax (211302801.py, line 1)

### Store `key` Attribute: Specify field for distinguishing documents

In [9]:
print(sandbox_store.key)

task_id


In [10]:
print(aneb_store.key)

task_id


In [19]:
sandbox_store.key = "batt_id"
aneb_store.key = "batt_id"

### Store `query_one()` and `query()` Method: Accessing database docs

In [12]:
doc = aneb_store.query_one({})
print(doc.keys())
print(doc["wf_uuid"])

dict_keys(['_id', 'last_updated', 'host', 'wf_uuid', 'end_points', 'tags', 'acr_notes', 'batt_id', 'cep', 'pathfinder', 'images'])
f720ba2f-be68-407f-a447-fc2ed7a20948


In [13]:
docs = aneb_store.query(criteria={"batt_id":"spinelTi2S4_Mg"})
for doc in docs:
    print(doc["batt_id"],doc["acr_notes"])

spinelTi2S4_Mg ['approx_neb_wf', 'all_images', '20191122_aneb_wf']
spinelTi2S4_Mg ['approx_neb_wf', 'all_images', '20210507_aneb_wf']


In [14]:
len(docs)

TypeError: object of type 'generator' has no len()

In [15]:
docs = list(aneb_store.query(criteria={"batt_id":"spinelTi2S4_Mg"}))
print(type(docs))

<class 'list'>


In [16]:
print(len(docs))

2


### Store `update()` Method: Storing database docs

Note: Updating stores requires read-write access

In [17]:
# check that your sandbox store is empty
sandbox_store.query_one({})

In [20]:
sandbox_store.update({"batt_id":"0_Mg","wf_uuid":None})

In [21]:
# sandbox store is no longer empty
sandbox_store.query_one({})

{'_id': ObjectId('62463e1e59ab35c8b4ff2efd'),
 'batt_id': '0_Mg',
 'wf_uuid': None}

In [22]:
# let's try storing the 2 aneb docs from earlier
# must delete the unique doc identifier "_id" field first
for d in docs:
    del d["_id"]

In [23]:
sandbox_store.update(docs)

In [24]:
# let's check that all docs are stored - expecting 1+2 docs
print(len(list(sandbox_store.query({}))))

2


In [25]:
# we did not successfully store the 2 aneb docs
# the issue is we were missing unique values for the batt_id key field
print(sandbox_store.key)

batt_id


In [26]:
# let's change to a more unique key field and try again
sandbox_store.key = "wf_uuid"
sandbox_store.update(docs)

In [27]:
# let's check that all docs are stored - should get 3 docs now
print(len(list(sandbox_store.query({}))))

3


### Store `count()` Method

In [28]:
sandbox_store.count()

3

In [29]:
sandbox_store.count({"batt_id":"0_Mg"})

1

### Store `distinct()` Method: Get distinct values of specified field

In [30]:
sandbox_store.distinct(field="batt_id")

['0_Mg', 'spinelTi2S4_Mg']

### Store `remove_docs()` Method

In [31]:
sandbox_store.remove_docs({"batt_id":"spinelTi2S4_Mg"})

In [32]:
sandbox_store.count()

1

## Builder Basics: Convert "Source" Store to "Target" Store

[copied from maggma's documentation...]

Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`:

1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase
2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage.
3. `update_target`: Add the processed item to the target Store(s).

### Getting a Builder

Note: source collection only requires **read-only** access but the target collection requires **read-write** access

In [33]:
from maggma.builders.map_builder import MapBuilder, CopyBuilder

In [34]:
# for builders derived from MapBuilder, it is important to check the store keys
# MapBuilder relies on the key field for comparing docs in the source and target store
aneb_store.key = "wf_uuid"
sandbox_store.key = "wf_uuid"

In [35]:
query = {"batt_id":"spinelTi2S4_Mg"} # specify which docs from source store to grab
builder = CopyBuilder(source=aneb_store, target=sandbox_store, query=query)

In [36]:
aneb_store.count(query)

2

In [37]:
sandbox_store.count(query)

0

### Builder `get_items()` Method

The number of items should match the count from the source store above

In [38]:
items = builder.get_items()
print(len(items))

TypeError: object of type 'generator' has no len()

In [39]:
items = list(builder.get_items())
print(type(items))
print(len(items))

<class 'list'>
2


In [40]:
for i in items:
    print(i[aneb_store.key],i.keys())

7e319c1a-175e-4414-94a6-eeef009707a0 dict_keys(['_id', 'wf_uuid', 'host', 'end_points', 'acr_notes', 'batt_id', 'source_mp_id', 'tags', 'last_updated', 'pathfinder', 'images'])
83cb0a92-3be7-4e14-9088-5a447fd90e87 dict_keys(['_id', 'wf_uuid', 'host', 'end_points', 'acr_notes', 'batt_id', 'source_mp_id', 'tags', 'last_updated', 'pathfinder', 'images'])


### Builder `process_item()` Method

In [41]:
item_for_target = builder.process_item(items[0])

In [42]:
print(item_for_target.keys())

dict_keys(['wf_uuid', 'last_updated', '_process_time', 'host', 'end_points', 'acr_notes', 'batt_id', 'source_mp_id', 'tags', 'pathfinder', 'images', 'state'])


### Builder `update_target()` Method

In [43]:
builder.update_targets([item_for_target])

In [44]:
sandbox_store.count(query)

1

### Running a Builder for Testing: `run()` Method

In [45]:
builder.run()

0it [00:00, ?it/s]

2022-03-31 16:50:59,587 - CopyBuilder - INFO - Starting CopyBuilder Builder
2022-03-31 16:50:59,643 - CopyBuilder - INFO - Processing 1 items
2022-03-31 16:50:59,825 - CopyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:50:59,825 - CopyBuilder - DEBUG - Processing: 83cb0a92-3be7-4e14-9088-5a447fd90e87


In [46]:
sandbox_store.count(query)

2

## Writing a Custom Builder

`MapBuilder` is very helpful when you want to perform the same action on documents from your source store. This can be accomplished by customizing the `unary_function()` method.

https://github.com/materialsproject/maggma/blob/main/src/maggma/builders/map_builder.py#L208

In [47]:
class MyBuilder(MapBuilder):
    def unary_function(self,item):
        item_for_target = {"wf_uuid":item["wf_uuid"], # important to keep key field
                           "note":"development",
                           "bid":item["batt_id"] # let's try renaming the batt_id field
                          }
        return item_for_target

In [48]:
my_builder = MyBuilder(source=aneb_store, target=sandbox_store, query=query)

Before running our custom builder, we need to remove documents from our target store

In [49]:
print(sandbox_store.count(query))
sandbox_store.remove_docs(query)
print(sandbox_store.count(query))

2
0


In [50]:
my_builder.run()

0it [00:00, ?it/s]

2022-03-31 16:51:58,988 - MyBuilder - INFO - Starting MyBuilder Builder
2022-03-31 16:51:58,988 - MyBuilder - INFO - Starting MyBuilder Builder
2022-03-31 16:51:59,163 - MyBuilder - INFO - Processing 2 items
2022-03-31 16:51:59,163 - MyBuilder - INFO - Processing 2 items
2022-03-31 16:51:59,471 - MyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:51:59,471 - MyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:51:59,471 - MyBuilder - DEBUG - Processing: 7e319c1a-175e-4414-94a6-eeef009707a0
2022-03-31 16:51:59,471 - MyBuilder - DEBUG - Processing: 7e319c1a-175e-4414-94a6-eeef009707a0
2022-03-31 16:51:59,471 - MyBuilder - DEBUG - Processing: 83cb0a92-3be7-4e14-9088-5a447fd90e87
2022-03-31 16:51:59,471 - MyBuilder - DEBUG - Processing: 83cb0a92-3be7-4e14-9088-5a447fd90e87


In [51]:
print(sandbox_store.count(query))

0


In [None]:
# our previous query needs to be modified because the custom builder renamed the batt_id field as bid
print(sandbox_store.count({"bid":"spinelTi2S4_Mg"}))

# Onboarding Independent Exercises

1. How many total documents are in the `fw_acr_mv/approx_neb` collection?

2. How many unique `batt_id` values are there for documents in the `fw_acr_mv/approx_neb` collection? How many unique `wf_uuid` values? 

3. Based on your answers to #1 and #2, what may be the best key field for the `fw_acr_mv/approx_neb` collection? Why?

4. Obtain the unique `tags` values that contain the string `"batch"`. Then, count the number of documents matching these tags values (e.g. 18 docs match "20191122_batch", 25 docs match "20191127_batch", etc.). (Hint: try the MongoDB query operators `"$in"` or `"$all"` for finding documents where a list contains a certain value: https://docs.mongodb.com/manual/reference/operator/query/)

5. Develop your own custom version of the MapBuilder and test out running it to check if it works as expected.

### 1. How many total documents are in the `fw_acr_mv/approx_neb` collection?

In [52]:
aneb_store.count()

214

### 2. How many unique `batt_id` values are there for documents in the `fw_acr_mv/approx_neb` collection? How many unique `wf_uuid` values? 

In [53]:
len(aneb_store.distinct(field="batt_id"))

199

In [54]:
len(aneb_store.distinct(field="wf_uuid"))

214

### 3. Based on your answers to #1 and #2, what may be the best key field for the `fw_acr_mv/approx_neb` collection? Why?

There are some cases where multiple approx_neb docs have the same batt_id. There is a distinct wf_uuid value for every approx_neb doc so this is a better key field for distinguishing each doc as would be important for MapBuilder.

However, the batt_id field may be a more appropriate key if we are interested in analysis for a given battery electrode. The best choice for key depends on the context.

### 4. Obtain the unique `tags` values that contain the string `"batch"`. Then, count the number of documents matching these tags values (e.g. 18 docs match "20191122_batch", 25 docs match "20191127_batch", etc.). 

(Hint: try the MongoDB query operators `"$in"` or `"$all"` for finding documents where a list contains a certain value: https://docs.mongodb.com/manual/reference/operator/query/)

In [55]:
batch_values = [i for i in aneb_store.distinct("tags") if "batch" in i]
batch_values

['20191122_batch',
 '20191127_batch',
 '20200203_batch',
 '20200214_batch',
 '20200406_batch',
 '20200410_batch',
 '20200603_batch',
 '20201106_batch',
 '20201109_batch',
 '20210215_batch',
 '20210511_batch',
 '20210712_batch',
 '20210922_batch',
 '20211109_batch']

In [56]:
for b in batch_values:
    print(b, aneb_store.count({"tags":{"$all":[b]}}))

20191122_batch 18
20191127_batch 25
20200203_batch 3
20200214_batch 70
20200406_batch 25
20200410_batch 9
20200603_batch 8
20201106_batch 31
20201109_batch 12
20210215_batch 2
20210511_batch 2
20210712_batch 4
20210922_batch 1
20211109_batch 2


### 5. Develop your own custom version of the MapBuilder and test out running it to check if it works as expected.

In [57]:
target_store = MongoStore(database= "local_dev",collection_name= "custom_builder")
target_store.connect()

In [58]:
from pymatgen.core import Structure

In [59]:
class MyBuilder(MapBuilder):
    def unary_function(self,item):
        """
        store the wf_uuid field, rename the batt_id field as bid, store the host composition
        """
        item_for_target = {
            "wf_uuid":item["wf_uuid"],
            "bid":item["batt_id"],
        }
        struct = Structure.from_dict(item["host"]["input_structure"])
        item_for_target.update({"host_composition":struct.composition})
        return item_for_target

In [60]:
custom_builder = MyBuilder(source=aneb_store, target=target_store, query=query)

In [61]:
custom_builder.run()

0it [00:00, ?it/s]

2022-03-31 16:52:27,543 - MyBuilder - INFO - Starting MyBuilder Builder
2022-03-31 16:52:27,543 - MyBuilder - INFO - Starting MyBuilder Builder
2022-03-31 16:52:27,543 - MyBuilder - INFO - Starting MyBuilder Builder
2022-03-31 16:52:27,968 - MyBuilder - INFO - Processing 2 items
2022-03-31 16:52:27,968 - MyBuilder - INFO - Processing 2 items
2022-03-31 16:52:27,968 - MyBuilder - INFO - Processing 2 items
2022-03-31 16:52:28,530 - MyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:52:28,530 - MyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:52:28,530 - MyBuilder - INFO - Processing batch of 1000 items
2022-03-31 16:52:28,530 - MyBuilder - DEBUG - Processing: 7e319c1a-175e-4414-94a6-eeef009707a0
2022-03-31 16:52:28,530 - MyBuilder - DEBUG - Processing: 7e319c1a-175e-4414-94a6-eeef009707a0
2022-03-31 16:52:28,530 - MyBuilder - DEBUG - Processing: 7e319c1a-175e-4414-94a6-eeef009707a0
2022-03-31 16:52:28,534 - MyBuilder - DEBUG - Processing: 83cb0a92-3be7-4e14-90