# Joint Undergraduate Researcher Onboarding #5
**Topic:** Introduction to maggma to ease working with MongoDB databases

**Date:** April 4, 2022

**Prepared by:** Ann Rutt

# Outline & Relevant Documentation

Onboarding Session Demonstration:
* Getting Stores: Connecting to your Database
* Store `key` Attribute: Specify field for distinguishing documents
* Store `query_one()` and `query()` Method: Accessing database docs
* Store `update()` Method: Storing database docs
* Store `count()` Method
* Store `distinct()` Method: Get distinct values of specified field
* Store `remove_docs()` Method
* Getting a Builder
* Builder `get_items()` Method
* Builder `process_item()` Method
* Builder `update_target()` Method
* Running a Builder for Testing: `run()` Method
* Writing a Custom Builder

Onboarding Independent Exercises Goals:
* Become familiar with the ApproxNEB data stored in `fw_acr_mv/approx_neb`
* Apply MongoDB querying and learn about more advanced querying techniques (https://docs.mongodb.com/manual/reference/operator/query/)
* Practice using common store methods (`query`, `count`, `distinct`)
* Try out developing and testing a builder by creating a custom MapBuilder

For further reading and general reference...
* Maggma Documentation: https://materialsproject.github.io/maggma/
* Maggma GitHub Repo: https://github.com/materialsproject/maggma

# Set-up Check: Mongogrant Credentials

Run the code in this section on your own before the onboarding session to see if everything is working correctly with your `.mongogrant.json` file which should be located in your home directory. 

**You need to be connected to the LBL VPN to successfully run this code.**

In [1]:
from maggma.stores.advanced_stores import MongograntStore
aneb_store = MongograntStore("ro:mongodb07-ext.nersc.gov/fw_acr_mv","approx_neb")
aneb_store.connect()
aneb_store.query_one({})

  from tqdm.autonotebook import tqdm


{'_id': ObjectId('5dbcecc3343e0926a3317022'),
 'last_updated': datetime.datetime(2016, 9, 21, 17, 1, 33, 101000),
 'host': {'dir_name': 'nid02991:/global/projecta/projectdirs/matgen/acrutt/block_2019-08-18-15-46-52-604494/launcher_2019-11-02-00-05-37-192454',
  'formula_pretty': 'FePO4',
  'input_structure': {'@module': 'pymatgen.core.structure',
   '@class': 'Structure',
   'charge': None,
   'lattice': {'matrix': [[10.007105, 8.8e-05, 6.5e-05],
     [0.000104, 11.868464, -0.000266],
     [6.4e-05, -0.00022, 9.786512]],
    'a': 10.007105000598,
    'b': 11.8684640034365,
    'c': 9.78651200268206,
    'alpha': 90.0025721334633,
    'beta': 89.99925316096,
    'gamma': 89.9989940967199,
    'volume': 1162.33390474949},
   'sites': [{'species': [{'element': 'Fe', 'occu': 1}],
     'abc': [0.775093, 0.124994, 0.26611],
     'xyz': [7.756467066181, 1.4834964532, 2.604305840961],
     'label': 'Fe',
     'properties': {}},
    {'species': [{'element': 'Fe', 'occu': 1}],
     'abc': [0.775

The expected output is a long doc print-out. Reach out for help if you run into any errors or issues here so we can troubleshoot individually.

# Onboarding Session Demonstration

If maggma has not been installed yet, run this command in terminal

`pip install maggma`

In [6]:
# maggma is more frequently updated so it is important to check the version you are using
import maggma
print(maggma.__version__)
print(maggma.__path__)

ModuleNotFoundError: No module named 'maggma'

In [None]:
# pprint (pretty print) can be helpful for making dictionaries more readable when printed 
from pprint import pprint

We will be using database credentials stored in .mongogrant.json to give you read-only access to Ann's database hosted on NERSC so you can practice working with relevant data.


**Place a copy of the provided .mongogrant.json file in your home directory**

## Store Basics

### Getting Stores: Connecting to your Database

A `MongoStore` object connects to your local MongoDB database

In [None]:
from maggma.stores.mongolike import MongoStore

In [None]:
sandbox_store = MongoStore(database= "local_dev",collection_name= "sandbox")
sandbox_store.connect()

A `MongograntStore` object connects to MongoDB databases (typically hosted on NERSC) with credentials stored in your `.mongogrant` file

In [None]:
from maggma.stores.advanced_stores import MongograntStore

In [None]:
aneb_store = MongograntStore("ro:mongodb07-ext.nersc.gov/fw_acr_mv","approx_neb") # note ro: specifies read-only access vs. rw: for read-write access
aneb_store.connect()

Advanced Tip: A `ConcatStore` can be used to combine multiple databases into a single store object

In [None]:
from maggma.stores.compound_stores import ConcatStore

In [None]:
combined_store = ConcatStore([store_1,store_2...])

### Store `key` Attribute: Specify field for distinguishing documents

### Store `query_one()` and `query()` Method: Accessing database docs

### Store `update()` Method: Storing database docs

Note: Updating stores requires read-write access

In [None]:
# check that your sandbox store is empty


In [None]:
# sandbox store is no longer empty


In [None]:
# let's try storing the 2 aneb docs from earlier
# must delete the unique doc identifier "_id" field first


In [None]:
# let's check that all docs are stored - expecting 1+2 docs


In [None]:
# we did not successfully store the 2 aneb docs
# the issue is we were missing unique values for the batt_id key field


In [None]:
# let's change to a more unique key field and try again


In [None]:
# let's check that all docs are stored - should get 3 docs now


### Store `count()` Method

### Store `distinct()` Method: Get distinct values of specified field

### Store `remove_docs()` Method

## Builder Basics: Convert "Source" Store to "Target" Store

[copied from maggma's documentation...]

Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`:

1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase
2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage.
3. `update_target`: Add the processed item to the target Store(s).

### Getting a Builder

Note: source collection only requires **read-only** access but the target collection requires **read-write** access

In [None]:
from maggma.builders.map_builder import MapBuilder, CopyBuilder

In [None]:
# for builders derived from MapBuilder, it is important to check the store keys
# MapBuilder relies on the key field for comparing docs in the source and target store
aneb_store.key = "wf_uuid"
sandbox_store.key = "wf_uuid"

In [None]:
query = {"batt_id":"spinelTi2S4_Mg"} # specify which docs from source store to grab
builder = CopyBuilder(source=aneb_store, target=sandbox_store, query=query)

In [None]:
aneb_store.count(query)

In [None]:
sandbox_store.count(query)

### Builder `get_items()` Method

The number of items should match the count from the source store above

In [None]:
items = builder.get_items()
print(len(items))

### Builder `process_item()` Method

### Builder `update_target()` Method

### Running a Builder for Testing: `run()` Method

## Writing a Custom Builder

`MapBuilder` is very helpful when you want to perform the same action on documents from your source store. This can be accomplished by customizing the `unary_function()` method.

https://github.com/materialsproject/maggma/blob/main/src/maggma/builders/map_builder.py#L208

In [None]:
class MyBuilder(MapBuilder):
    def unary_function(self,item):
        item_for_target = {"wf_uuid":item["wf_uuid"], # important to keep key field
                           "note":"development",
                           "bid":item["batt_id"] # let's try renaming the batt_id field
                          }
        return item_for_target

In [None]:
my_builder = MyBuilder(source=aneb_store, target=sandbox_store, query=query)

Before running our custom builder, we need to remove documents from our target store

In [None]:
print(sandbox_store.count(query))
sandbox_store.remove_docs(query)
print(sandbox_store.count(query))

In [None]:
my_builder.run()

In [None]:
print(sandbox_store.count(query))

# Onboarding Independent Exercises

1. How many total documents are in the `fw_acr_mv/approx_neb` collection?

2. How many unique `batt_id` values are there for documents in the `fw_acr_mv/approx_neb` collection? How many unique `wf_uuid` values? 

3. Based on your answers to #1 and #2, what may be the best key field for the `fw_acr_mv/approx_neb` collection? Why?

4. Obtain the unique `tags` values that contain the string `"batch"`. Then, count the number of documents matching these tags values (e.g. 18 docs match "20191122_batch", 25 docs match "20191127_batch", etc.). (Hint: try the MongoDB query operators `"$in"` or `"$all"` for finding documents where a list contains a certain value: https://docs.mongodb.com/manual/reference/operator/query/)

5. Develop your own custom version of the MapBuilder and test out running it to check if it works as expected.

### 1. How many total documents are in the `fw_acr_mv/approx_neb` collection?

### 2. How many unique `batt_id` values are there for documents in the `fw_acr_mv/approx_neb` collection? How many unique `wf_uuid` values? 

### 3. Based on your answers to #1 and #2, what may be the best key field for the `fw_acr_mv/approx_neb` collection? Why?

There are some cases where multiple approx_neb docs have the same batt_id. There is a distinct wf_uuid value for every approx_neb doc so this is a better key field for distinguishing each doc as would be important for MapBuilder.

However, the batt_id field may be a more appropriate key if we are interested in analysis for a given battery electrode. The best choice for key depends on the context.

### 4. Obtain the unique `tags` values that contain the string `"batch"`. Then, count the number of documents matching these tags values (e.g. 18 docs match "20191122_batch", 25 docs match "20191127_batch", etc.). 

(Hint: try the MongoDB query operators `"$in"` or `"$all"` for finding documents where a list contains a certain value: https://docs.mongodb.com/manual/reference/operator/query/)

### 5. Develop your own custom version of the MapBuilder and test out running it to check if it works as expected.