# How to organize data with AiiDA

When doing *manual* calculations, one typically use files and directories to organize their data.  
The names of directories, README files, nested folders essentially allows one to trace the progress and pick up where they have left from the past.

One chanllege of using an automated platform, coming from files and folders, is that the organization of data and tracking progress becomes less familier.  
This notebooks show cases how data, workflows, calculations can be organized efficiently when using AiiDA


In [1]:
%load_ext aiida

from aiida import load_profile, engine, orm, plugins
from aiida.storage.sqlite_temp import SqliteTempBackend

profile = load_profile(
    SqliteTempBackend.create_profile(
        'myprofile',
        sandbox_path='_sandbox',
        options={
            'warnings.development_version': False,
            'runner.poll.interval': 1
        },
        debug=False
    ),
    allow_switch=True
)
profile

Profile<uuid='7ed9f3f2bc6a45c98f75b6734a829020' name='myprofile'>

In [2]:
from aiida.orm import Data, Node, load_node

## `id`, `uudi` and  Node attributes

Data in AiiDA is represented by `Node` instances and linkes between nodes.
Here we only focus on the former, as the latter are typically automatically generated/interpreted without human intervention.

When doing calculations, there will be input data, e.g. input crystal structures, and output data, e.g. those properties calculated, calculation files.
There will also be calculations themselves, and the workflows associated, all represented as `Node` objects.

This sections shows cases how metadata can be attached to a node 

First, generate a simple node instance

In [3]:
node = Data()

In [4]:
node

<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (unstored)>

In [5]:
isinstance(node, Node)

True

Note that this `Data` node is not stored yet, but an *UUID* has been assigned to it. 

The *UUID* is an unique identifier for the `Node`, and will remain the same forever, even if the data is exported, and import to other databases. 

In [6]:
node.id

In [7]:
node.pk

The `id` (or `pk`) is a also a key for referencing the `Data` node. It is used internally by the database engine for efficient referencing to the same `Node` instance.  
It is only assigned when the node is `stored`. As you can see, it is empty for now since the `Data` node is not stored yet.

In [8]:
node.attributes

{}

The `attributes` are properties associated with the node, it is dictionary that store key-value pairs. 
It can be set by the user if needed.

In [9]:
node.set_attribute("foo", "bar")

In [10]:
node.attributes

{'foo': 'bar'}

In [11]:
node.get_attribute("foo")

'bar'

We now store the `Data` node.

In [12]:
node.store()

<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>

Stored nodes can be loaded again withing `id` or `uuid`.

In [13]:
node2 = load_node(node.id)
assert node2.id == node.id
assert node2.uuid == node.uuid
assert node2.attributes == node.attributes
node2 

<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>

In [14]:
hash(node2) == hash(node)

True

Loading with `uuid`

In [15]:
load_node(node.uuid[:8])

<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>

Finding the same node by querying - give me all the nodes that contains `{'foo': 'bar'}` in its attributes

In [16]:
Data.objects.find(filters={'attributes.foo': 'bar'})

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

Difference between `uuid` and `id`:

* `id` is used *internally* by the database engine, and it is unique for **only this database**. The same `id` can mean differnt nodes in different database. 
* `uuid` is an unique identifier and will always remain unchanged, even if the data is imported into other databases. 
* `id` is only assigned when the node is stored.
* `uuid` is assigned when the node is created.

Directly reference `id` or `uuid` allows the same stored data to be loaded later for analysis/launching calculations.

However, while the use of `uuid` allow *unambiguous* reference of the same data, remembering its value in head is in possible, and writing it down somewhere (in a notebook) can be tedious.

Fortunately, there are arrays of tools for adding metadata to `Node` instances and allow them to be reference in flexible way later.

## `mtime` and `ctime`

Timestamps are useful to now when the data was created, modified.

time of creation

In [17]:
node.ctime.isoformat()

'2022-05-25T12:24:05.011425+00:00'

time of modification

In [18]:
node.mtime.isoformat()

'2022-05-25T12:24:05.104282+00:00'

## `label` and `description`

`label` as it name suggests, allows us to label the node. They are emtpy by default, and can be assigned *before or after* the node is stored

In [19]:
node.label

''

In [20]:
node.description

''

In [21]:
node.label = 'materials1-run1-input-data'
node.description = 'Some data for my calculation / analysis'

In [22]:
node.label

'materials1-run1-input-data'

In [23]:
node.description

'Some data for my calculation / analysis'

❗🚀One can also use `label` to reference the node

In [24]:
load_node('materials1-run1-input-data')

<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>

😝However, this obviously does not work if there are nodes with the same label...

In [25]:
node2 = Data()
node2.label = 'materials1-run1-input-data'
node2.store()

<Data: uuid: fe6bc62b-cd40-4c16-b7ae-28585f7e1413 (pk: 2)>

In [26]:
load_node('materials1-run1-input-data')

MultipleObjectsError: multiple Node entries found with LABEL<materials1-run1-input-data>

😄To get us out of this situation, one can query to find all nodes with the same label

In [27]:
Data.objects.find(filters={'label': 'materials1-run1-input-data'})

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>,
 <Data: uuid: fe6bc62b-cd40-4c16-b7ae-28585f7e1413 (pk: 2)>]

or to get (any) one of them

In [28]:
Data.objects.query(filters={'label': 'materials1-run1-input-data'}).first()

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

It is also possibel to query by the `description`, in fact, almost all of the data stored in `Node` instances can be queried (nice thing when your data is stored in a database!⭐)

In [29]:
Data.objects.find(filters={'description': 'Some data for my calculation / analysis'})

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

or with pattern matching

In [30]:
Data.objects.find(filters={'description': {'like': 'Some data%'}})

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

❗💡While description and labels are very powerful for adding metadata to the nodes, care should be takens that they can be changed anytime.  
If you build your analysis code purely based on referencing with `label` and `description`, they can break easily. 

## Use `extras` to store additional (meta)data
Once a node is stored, its `attributes` cannot be changed (in most cases)

In [31]:
node.set_attribute('ha', 42)

ModificationNotAllowed: the attributes of a stored entity are immutable

However, the `extras` field can be changed, just like the label and the descriptions

In [32]:
node.set_extra('ha', 42)

In [33]:
node.extras

{'_aiida_hash': 'f29c067972a92842ec9f252cc103bcbc08dafaf462277c4dbf02d0f69b28cf64',
 'ha': 42}

and it can be used for querying just like the others

In [34]:
Data.objects.find(filters={'extras.ha': 42})

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

In some cases, it can be useful to *tag* nodes this way. For example, crystals structure imported from materials project can be taged with their `entry_id` provided.

See also: `aiida_user_addons.pymatgen.load_structure_from_mp`.

## `comments` for commenting

One can also add comments to a node. Compared with `label` and `description`, there can be multiple comments associated wtih the same node

In [35]:
node.add_comment("My awsome input parameters!")

<aiida.orm.comments.Comment at 0x7fc7157f94f0>

In [36]:
node.get_comments()

[<aiida.orm.comments.Comment at 0x7fc7157ab550>]

note that comment are also stored in the database

In [37]:
node.get_comments()[0].uuid, node.get_comments()[0].content

('ec1f0019-6bdd-4db9-823c-ff1535074944', 'My awsome input parameters!')

This means that they can be queried

In [38]:
from aiida.orm import Comment

In [39]:
comments = Comment.objects.find(filters={'content': {'like': '%awsome%'}})
comments

[<aiida.orm.comments.Comment at 0x7fc7157bdb80>]

In [40]:
comments[0]

<aiida.orm.comments.Comment at 0x7fc7157bdb80>

## ⭐ Groups

Groups are containers of `Nodes`. They have `id` and `uuid` as well, but it is common to refer them using the `label`, as there are typically only as few groups that are out there.

In [41]:
from aiida.orm import Group

Create a group called `mygroup`, if there isn't one already. If there is one, just return it.

In [42]:
group = Group.objects.get_or_create(label="mygroup")[0]

Add nodes to this group

In [43]:
group.add_nodes(node)

List all nodes in the group

In [44]:
list(group.nodes)

[<Data: uuid: 7337d4d8-589b-4478-8951-948b7fa053fa (pk: 1)>]

Nodes inside the group can be iterated

In [45]:
for node in group.nodes:
    print(node.label)

materials1-run1-input-data


Remove nodes in the group

In [46]:
group.remove_nodes(node)

In [47]:
list(group.nodes)

[]

## `GroupPathX`

`GroupPathX` extends the `GroupPath` as implemented in `aiida-core`. The latter allows nesting a `Group` inside a `Group` virtually (not in the database).
Given a group with label `mygroup` and a second group with label `mygroup/subgroup`, the latter is treated as if it is nested under the former.

`GroupPathX` allow `Node` that are contained inside the `Group` to be referenced similarly.   
For example, `mygroup/subgroup/mynode` can be used to reference the `Node` with *alias* `mynode` stored in group with label `mygroup/subgroup`.  

The *alias* is another property of the `Node` that is associated with `mygroup/subgroup`.  
Each node can have multiple *alias* targeted for different group. 

This is analogous to the filesystem:

- `Group` -> directory
- `Node` with alias -> files

In [48]:
from aiida_grouppathx import GroupPathX, decorate_node, decorate_with_uuid, decorate_with_exit_status

In [49]:
group1 = GroupPathX("mygroup")
group2 = GroupPathX("mygroup/subgroup")

In [50]:
group2.is_virtual

True

There is no group called `mygroup/subgroup` yet... We creat it now!

In [51]:
group2.get_or_create_group()

(<Group: 'mygroup/subgroup' [type core], of user user@email.com>, True)

In [52]:
group2.is_virtual

False

In [53]:
group2.is_group

True

List the content of `mygroup`

In [54]:
group1.show_tree()

mygroup
└── subgroup



Let's add some nodes that are "named"

In [55]:
group1.add_node(node, "my_node")

In [56]:
group1.show_tree()

mygroup
├── my_node *
└── subgroup



Note that you can still add node that are unnamed

In [57]:
group1.get_group().add_nodes(Data().store())

The *leaf* that suffixed by `*` marks that it is a node.

In [58]:
group1.show_tree()

mygroup
├── my_node *
└── subgroup



But this can be customized as well

In [59]:
group1.show_tree(decorate_node, decorate_with_uuid)

mygroup
├── my_node * | 7337d4d8-589
└── subgroup



The extra `Data` node is not present since it does not have an *alias*, we can still get it by using the convenience method below

In [60]:
group1.list_nodes_without_alias()

[<Data: uuid: 005016fa-ce21-47e8-b6a6-674cae3a52c5 (pk: 3)>]

The *alias* is just stored in the `extras` attribute

In [61]:
group1.list_nodes_without_alias()[0].extras

{'_aiida_hash': 'b258839769b0bf8bda84bf09bcdd0a39d5df30402925e0e0e0d42a62708bf975'}

In [62]:
group1['my_node'].get_node().extras

{'_aiida_hash': 'f29c067972a92842ec9f252cc103bcbc08dafaf462277c4dbf02d0f69b28cf64',
 'ha': 42,
 '_group_alias': {'538a6b0c-96db-4f9b-b249-d651ac8bb87f': 'my_node'}}


Let's do a simple calculation, `1 + 2 = 3`.  
Store two inputs, `Int(1)` and `Int(2)` under `mygroup`, and the results in `subgroup`.

In [63]:
from aiida.engine import calcfunction
from aiida.orm import Int
@calcfunction
def add(a, b):
    return a + b

In [64]:
group1.add_node(Int(1).store(), "my_one")
group1.add_node(Int(2).store(), "my_two")
group2.add_node(add(group1['my_one'].get_node(), group1['my_two'].get_node()), "my_result")

In [65]:
group1.show_tree()

mygroup
├── my_node *
├── my_one *
├── my_two *
└── subgroup
    └── my_result *



We can also decorate the nodes with other properties, such as their value

In [66]:
def decorate_by_value(path):
    if path.is_node:
        node = path.get_node()
        if isinstance(node, Int):
            return "Int: " + str(node.value)
    

In [67]:
group1.show_tree(decorate_node, decorate_with_uuid, decorate_by_value)

mygroup
├── my_node * | 7337d4d8-589
├── my_one * | 2ec62214-2bc | Int: 1
├── my_two * | 2066f714-02a | Int: 2
└── subgroup
    └── my_result * | 61f9e4c3-569 | Int: 3



One can put the calculation node into the group as well

In [68]:
group1.add_node(group2['my_result'].get_node().get_incoming().one().node, "calculation")

In [69]:
group1.show_tree(decorate_node, decorate_with_exit_status)

mygroup
├── calculation * | [0]
├── my_node *
├── my_one *
├── my_two *
└── subgroup
    └── my_result *



But this is not necessary, since `group2['my_result']` already contains the link pointing towards the calculation, and can be used to trace the inputs, as done automatically by aiida.

```
group2['my_result'].get_node().get_incoming().one().node
```

However, for actual DFT calculations, one calls the `submit` function and get the `ProcessNode` back. The `ProcessNode` can be added to group to allow monitoring/extracting data later on.

For example:

In [70]:
calc = group1['calculation'].get_node()
calc.outputs.result

<Int: uuid: 61f9e4c3-5696-45d2-8b9b-20d199994f92 (pk: 7) value: 3>

## Remarks 

While the use of `Node` for storing every can seem overwhelming, AiiDA provides an array of tools for organizing data.  


With the help of `GroupPathX` ([aiida-grouppathx](https://github.com/zhubonan/aiida-grouppathx/)), a plugin that does not modify aiida-core's code in any way, one can work and organize data in a way that is similiar to the files and foldres approach.