# User-defined metadata

Next to system metadata, iRODS allows you to create own metadata with data objects and collections.

You can use that metadata to describe your data and later search for this data; and it can help you keeping the overview of what was the input for an analysis and what is the outcome.

<img src="img/DataObject5.png" width="400">

Technically, iRODS offers metadata as key-value-units triple. Let's investigate this:

## Add metadata to data objects

As always: first we have to create an iRODS session:

In [None]:
from ibridges.interactive import interactive_auth
session = interactive_auth()

Make sure we have our *demo* collection and file available:

In [None]:
from ibridges.path import IrodsPath

irods_path = IrodsPath(session, '~')
print("Current working location:", irods_path)
irods_coll_path = irods_path.joinpath('demo')
print("New collection name:", irods_coll_path)
coll = IrodsPath.create_collection(session, irods_coll_path)
print("New collection is created:", irods_coll_path.collection_exists())

Now we can retrieve a data object and insect its metadata.

In [None]:
from ibridges.path import IrodsPath

irods_coll_path = IrodsPath(session, '~').joinpath('demo')
obj = irods_coll_path.joinpath('demofile.txt')
print(obj.meta)

Most probably you will see no metadata in the above cell. **Note, that system metadata and user-defined metadata are two different entities in a data object!**
With the command `MetaData(obj)` we only retrieve the user-defined metadata.

<img src="img/DataObject4.png" width="400">

Now we can add some own metadata. The metadata comes as key-value-units triple:

In [None]:
obj.meta.add('Key', 'Value', 'Units')
print(obj.meta)

Sometimes we do not really have `units`, so we can leave this part empty:

In [None]:
obj.meta.add('Author', 'Christine')
print(obj.meta)

We can also add a second author:

In [None]:
obj.meta.add('Author', 'Raoul')
print(obj.meta)

You see, that keys in **iRODS metadata keys can have different values**. That is different from python dictionaries where one key can only have one value. **How then to overwrite a value?**

## Overwrite metadata

If you wish to *overwrite* a value, you will first have to remove the old metadata and subsequently add a new metadata entry. **NOTE, that all entries with the key will be deleted.** If you want to be more specific you will need to give the value and the units.

In [None]:
obj.meta.delete('Author')
print(obj.meta)

In [None]:
obj.meta.add('Author', 'Raoul')
obj.meta.add('Author', 'Christine')
print(obj.meta)

In [None]:
obj.meta.delete('Author', 'Christine')
print(obj.meta)

You can also set all existing values to **one** new value:

In [None]:
obj.meta.set('Author', 'Maarten')
print(obj.meta)

iRODS metadata also has a an entry called `units`. The same principles that we showed above, i.e. having the same key-value pair with several units, deleting and setting values, apply to units.

In [None]:
obj.meta.add('key', 'value', 'units1')
obj.meta.add('key', 'value', 'units2')
print(obj.meta)

In [None]:
obj.meta.set('key', 'value', 'units3')
print(obj.meta)

In [None]:
obj.meta.delete('key', 'value')
print(obj.meta)

## Add metadata to collections

The same functionality we saw above, we can use for collections:

In [None]:
coll = irods_coll_path
print(coll.meta)

In [None]:
coll.meta.add('TypeOfCollection', 'Results')
print(coll.meta)

## Which metadata can help you keeping an overview?

iRODS metadata can help you keeping an overview while you are working with data and maybe many files which have relations to each other. There are ontologies which define keywords and links between keywords like the **[prov-o Ontology](https://www.w3.org/TR/prov-o/#prov-o-at-a-glance)**.

Let's see how we can annotate our test data, so that we know that it is test data.

In [None]:
from datetime import datetime
coll.meta.add('prov:wasGeneratedBy', 'Christine')
coll.meta.add('CollectionType', 'testcollection')
obj.meta.add('prov:SoftwareAgent', 'iRODS jupyter Tutorial')
obj.meta.add('prov:wasGeneratedBy', 'Maarten')
obj.meta.add('DataType', 'testdata')

Now we have some more descriptive metadata that gives us hints, in which context the data was created:

In [None]:
print(coll.meta)
print()
print(obj.meta)

## Finding data by their metadata

Metadata does not only help you to keep an overview over your data, but can also be used to select and retrieve data. In iBridges you can use the user-defined metadata and some system metadata fields to search for data.

In our first example, we are looking for objects and collections called *demo* in our `home`:

In [None]:
from ibridges.search import search_data, MetaSearch
result = search_data(session, path=session.home, path_pattern="demo")
print(result)

The output is a list of IrodsPath's indicating the locations of the data objects and collections.
If no `path` is provided, *ibridges* will automatically fall back on your `home`.

In [None]:
result = search_data(session, metadata=MetaSearch(key='prov:wasGeneratedBy', value='Christine'))
print(result)

If we do not want to specify the particular value for this metadata entry, we can leave it out.

In [None]:
result = search_data(session, metadata=MetaSearch(key='prov:wasGeneratedBy'))
print(result)

Now we also receive the data object that was generated by *Maarten*.

And of course we can combine information about the path and the metadata. they will be connected with `and`. The following search will retrieve all data objects and collections wich are labeled with a metadata key *'prov:wasGeneratedBy'* and whose path has the prefix */nluu12p/home/research-test-christine/demo/*.

In [None]:
result = search_data(session, path=IrodsPath(session, session.home, 'demo'),
                     metadata=MetaSearch(key='prov:wasGeneratedBy'))
print(result)

## Searches using wildcards

Sometimes we are not sure about the exact pattern that we search for, be it metadata keys, values and units or path patterns. iRODS knows the `%` sign as a wild card.

### Wildcards in metadata

Assume we know that some data was annotated according to the Prov-O Ontology and its abbrviation is `prov`, but we do not know which terms of that Ontology was used. In such a case we can find all metadata annotated with a key with the prefix `prov:` like this:

In [None]:
result = search_data(session, path=IrodsPath(session, session.home),
                     metadata=MetaSearch(key='prov:%'))
print(result)

### Wildcards in path patterns

Let us go back to the very first example of this section, we are loking for collections and data objects called `demo` and they need to lie directly in our `home`:

In [None]:
result = search_data(session, path=session.home, path_pattern="demo")
print(result)

How can we retrieve all `demo` collections and objects even if they are or lie in subcollections? Lets first create subcollections in `demo` called `demo` and `demo1`. 

In [None]:
irods_path = IrodsPath(session, "demo", "demo")
print(irods_path)
IrodsPath.create_collection(session, irods_path)
irods_path = IrodsPath(session, "demo", "demo1")
print(irods_path)
IrodsPath.create_collection(session, irods_path)

Now let's see how to use the wildcard to find those two collections.

#### 1. Find all data and collections ending with `demo`

In [None]:
result = search_data(session, path=session.home, path_pattern="%demo")
print('\n'.join([str(p) for p in result]))

#### 2. Find all data and collections starting with `demo`

In [None]:
result = search_data(session, path=session.home, path_pattern="demo%")
print('\n'.join([str(p) for p in result]))

#### 3. Find all collections and data called `demo` that on the 5th layer of the collection tree

In [None]:
result = search_data(session, path=session.home, path_pattern="%/%/%/%/%/demo")
print('\n'.join([str(p) for p in result]))

#### 4. Find all `txt` files that lie on a collection path that contains `demo`

For this case we have to think of a pattern for the collection path and the object name and separate both with `/`:

In [None]:
coll_pattern = "%demo%"
obj_pattern = "%.txt"
print(f"Search pattern: {coll_pattern+'/'+obj_pattern}")
result = search_data(session, path=session.home, path_pattern=coll_pattern+"/"+obj_pattern)
print('\n'.join([str(p) for p in result]))

## Retrieving data

Now that we have the search results we can use the `IrodsPath` to download them or to fetch more information:

In [None]:
print(result[0].size)
print(result[0].collection_exists())
print(result[0].dataobject_exists())

# Metadata archives

In most cases the user is encouraged to access and manipulate metadata through the `MetaData` class. However, there are some cases where it can be useful to create an archive of all metadata in a collection and all subcollections and data objects. One example might be a backup of the data and metadata on a system that does not support metadata. Another might be to easily transfer metadata from one iRODS system to another. A final use case might be having access to the metadata during computation on a system that is not connected to the internet.

## Creating a metadata archive

In [None]:
from ibridges.data_operations import create_meta_archive

collection_path = IrodsPath(session, "demo")
create_meta_archive(session, collection_path, "meta_archive.json")

This creates a file "meta_archive.json" in your current local directory of this jupyter notebook which contains all metadata of all subcollections and data objects in this collection "demo".

In [None]:
!cat meta_archive.json

## Applying a metadata archive

This restores/overwrites the metadata on the iRODS server with the metadata from the archive. Make sure that the paths of the subcollections and data objects have not changed.

In [None]:
from ibridges.data_operations import apply_meta_archive

apply_meta_archive(session, "meta_archive.json", collection_path)