# User-defined metadata

Next to system metadata, iRODS allows you to create own metadata with data objects and collections.

You can use that metadata to describe your data and later search for this data; and it can help you keeping the overview of what was the input for an analysis and what is the outcome.

<img src="img/DataObject5.png" width="400">

Technically, iRODS offers metadata as key-value-units triple. Let's investigate this:

As always: first we have to create an iRODS session:

In [None]:
from ibridges.interactive import interactive_auth
import warnings
warnings.filterwarnings('ignore')

session = interactive_auth()

## Add metadata to an `IrodsPath`

Make sure we have our *demo* collection and object available:

In [None]:
from ibridges.path import IrodsPath

irods_path = IrodsPath(session, '~')
print("Current working location:", irods_path)
irods_coll_path = irods_path.joinpath('demo')
irods_obj_path = irods_coll_path / 'demofile.txt'
print("Demo collection name:", irods_coll_path, "exists: ", irods_coll_path.collection_exists())
print("Demo object name", irods_obj_path, "exists: ", irods_obj_path.dataobject_exists())

We can retrieve the metadata associated with the data object from its `IrodsPath`, for convenience we will store it in the variable `obj_meta`. The `obj_meta` is no longer an `IrodsPath` but of type `MetaData`:

In [None]:
print(irods_obj_path.meta)
obj_meta = irods_obj_path.meta
print(type(obj_meta))

Most probably you will see no metadata in the output of the above cell. 

**Note, that system metadata and user-defined metadata are two different entities in a data object!**

With the command `IrodsPath.meta` we only retrieve the user-defined metadata.

<img src="img/DataObject4.png" width="400">

Now we can add some own metadata. The metadata comes as key-value-units triple:

In [None]:
obj_meta.add('Key', 'Value', 'Units')
print(obj_meta)

Sometimes we do not really have `units`, so we can leave this part empty:

In [None]:
obj_meta.add('Author', 'Christine')
print(obj_meta)

We can also add a second author:

In [None]:
obj_meta.add('Author', 'Alice')
print(obj_meta)

You see, that in **iRODS metadata keys can have different values**. That is different from python dictionaries where one key can only have one value. **How then to overwrite a value?**

## Overwrite metadata

If you wish to *overwrite* a key, value or units, we will first have to retrieve the respective metadata item. You can retrieve an item by providing the key. If you have several items with the same key you will have to provide the value too and sometimes also the units.

The syntax looks like accessing a dictionary. Let's have a look how to retrieve the author metadata:

In [None]:
obj_meta["Author"]

*iBridges* complains that there are several metadata items with the key `Author`. Let's have a look at all of those:

In [None]:
print(obj_meta.find_all('Author'))

Now we can retrieve the one where the author is `Christine`:

In [None]:
meta_item = obj_meta['Author', 'Christine']
print(meta_item)

And we can change the value of exactly that metadata item:

In [None]:
print(meta_item)
meta_item.value = "AnotherAuthor"
print(meta_item)

**Important**: What happens if we would change the metadata item to one which is already present in the metadata of the object? Changing `AnotherAuthor` to `Alice` would create an identical metadata item in the list of all metadata of that object. Let's try it out:

In [None]:
meta_item.value = 'Alice'

Of course you can also alter the `key` and the value of a metadata item:

In [None]:
print("Changing: ", meta_item)
meta_item.key = 'Key'
print("Overwriting the key:", meta_item)
meta_item.units = 'MyUnits'
print("Overwriting the units:", meta_item)

### Setting metadata

Another way to set a metadata key to a new value and units is with the bracket `[]` notation.

In [None]:
print(obj_meta)

In [None]:
obj_meta['Author'] = 'person'
print(obj_meta)

**Note**, that if there are several entries with the same key, the following will fail:

In [None]:
obj_meta['Key'] = 'OtherValue'
print(obj_meta)

If you like to set all metadata items to one new item, do:

In [None]:
obj_meta[['Key']] = [['OtherValue']]

In [None]:
print(obj_meta)

## Deleting metadata

In [None]:
obj_meta.add('Author', 'Christine')
print(obj_meta)

### Deleting a single metadata item

To delete a single metadata item you will have to be again specific with your key, value and units information to identify the correct metadata item. To delete all metadata with the key `Key` we can simply use:

In [None]:
obj_meta.delete('Key')
print(obj_meta)

The same command on the metadata with the key `Author` would delete all of the entries:

In [None]:
obj_meta.delete('Author')
print(obj_meta)

If you want to clear the whole metadata, use:

In [None]:
obj_meta.clear()
print(obj_meta)

## Which metadata can help you keeping an overview?

iRODS metadata can help you keeping an overview while you are working with data and many files which have relations to each other. There are ontologies which define keywords and links between keywords like the **[prov-o Ontology](https://www.w3.org/TR/prov-o/#prov-o-at-a-glance)**.

Let's see how we can annotate our test data, so that we know that it is test data.

In [None]:
from datetime import datetime
coll_meta = irods_coll_path.meta
coll_meta.add('prov:wasGeneratedBy', 'Christine')
coll_meta.add('CollectionType', 'testcollection')
obj_meta.add('prov:SoftwareAgent', 'iRODS jupyter Tutorial')
obj_meta.add('prov:wasGeneratedBy', 'Maarten')
obj_meta.add('DataType', 'testdata')

Now we have some more descriptive metadata that gives us hints, in which context the data was created:

In [None]:
print(coll_meta)
print()
print(obj_meta)

## Finding data by their metadata

Metadata does not only help you to keep an overview over your data, but can also be used to select and retrieve data. In iBridges you can use the user-defined metadata and some system metadata fields to search for data.

In our first example, we are looking for objects and collections called *demo* in our `home`:

In [None]:
from ibridges.search import search_data, MetaSearch
result = search_data(session, path=session.home, path_pattern="demo")
print(result)

The output is a list of `CachedIrodsPaths` indicating the locations of the data objects and collections.
If the parameter `path` is not provided, *ibridges* will automatically fall back on your `home`.

In [None]:
result = search_data(session, metadata=MetaSearch(key='prov:wasGeneratedBy', value='Christine'))
print(result)

If we do not want to specify the particular value for this metadata entry, we can leave it out.

In [None]:
result = search_data(session, metadata=MetaSearch(key='prov:wasGeneratedBy'))
print(result)

Now we also receive the data object that was generated by *Maarten*.

And of course we can combine information about the path and the metadata. they will be connected with `and`. The following search will retrieve all data objects and collections wich are labeled with a metadata key *'prov:wasGeneratedBy'* and whose path has the prefix */nluu12p/home/research-test-christine/demo/*.

In [None]:
result = search_data(session, path=IrodsPath(session, session.home, 'demo'),
                     metadata=MetaSearch(key='prov:wasGeneratedBy'))
print(result)

## Searches using wildcards

Sometimes we are not sure about the exact pattern that we search for, be it metadata keys, values and units or path patterns. iRODS knows the `%` sign as a wild card.

### Wildcards in metadata

Assume we know that some data was annotated according to the Prov-O Ontology and its abbrviation is `prov`, but we do not know which terms of that Ontology was used. In such a case we can find all metadata annotated with a key with the prefix `prov:` like this:

In [None]:
result = search_data(session, path=IrodsPath(session, session.home),
                     metadata=MetaSearch(key='prov:%'))
print(result)

### Wildcards in path patterns

Let us go back to the very first example of this section, we are loking for collections and data objects called `demo` and they need to lie directly in our `home`:

In [None]:
result = search_data(session, path=session.home, path_pattern="demo")
print(result)

How can we retrieve all `demo` collections and objects even if they are or lie in subcollections? Lets first create subcollections in `demo` called `demo` and `demo1`. 

In [None]:
irods_path = IrodsPath(session, "demo", "demo")
print(irods_path)
IrodsPath.create_collection(irods_path)
irods_path = IrodsPath(session, "demo", "demo1")
print(irods_path)
IrodsPath.create_collection(irods_path)

Now let's see how to use the wildcard to find those two collections.

#### 1. Find all data and collections ending with `demo`

In [None]:
result = search_data(session, path=session.home, path_pattern="%demo")
print('\n'.join([str(p) for p in result]))

#### 2. Find all data and collections starting with `demo`

In [None]:
result = search_data(session, path=session.home, path_pattern="demo%")
print('\n'.join([str(p) for p in result]))

#### 3. Find all collections and data called `demo` on the 5th layer of the collection tree

In [None]:
result = search_data(session, path=session.home, path_pattern="%/%/%/%/%/demo")
print('\n'.join([str(p) for p in result]))

#### 4. Find all `txt` files that lie on a collection path that contains `demo`

For this case we have to think of a pattern for the collection path and the object name and separate both with `/`:

In [None]:
coll_pattern = "%demo%"
obj_pattern = "%.txt"
print(f"Search pattern: {coll_pattern+'/'+obj_pattern}")
result = search_data(session, path=session.home, path_pattern=coll_pattern+"/"+obj_pattern)
print('\n'.join([str(p) for p in result]))

## Retrieving data

Now that we have the search results we can use the `CachedIrodsPath` to download them or to fetch more information.

**Note, the `CachedIrodsPath` contains information, e.g. checksum and size at the time of the search.**

In [None]:
print(type(result[0]))
print(result[0].size)
print(result[0].checksum)
print(result[0].collection_exists())
print(result[0].dataobject_exists())

In case you need to be sure about the current size or checksum, you will have to cast the path again to an `IrodsPath`.

In [None]:
ipath = IrodsPath(session, result[0])
type(ipath)

# Metadata archives

In most cases the user is encouraged to access and manipulate metadata through the `MetaData` class. However, there are some cases where it can be useful to create an archive of all metadata in a collection and all subcollections and data objects. One example might be a backup of the data and metadata on a system that does not support metadata. Another might be to easily transfer metadata from one iRODS system to another. A final use case might be having access to the metadata during computation on a system that is not connected to the internet.

## Creating a metadata archive

In [None]:
collection_path = IrodsPath(session, "demo")
collection_path.create_meta_archive("meta_archive.json")

This creates a file "meta_archive.json" in your current local directory of this jupyter notebook which contains all metadata of all subcollections and data objects in this collection "demo".

In [None]:
!cat meta_archive.json

## Applying a metadata archive

This restores/overwrites the metadata on the iRODS server with the metadata from the archive. Make sure that the paths of the subcollections and data objects have not changed.

In [None]:
collection_path.apply_meta_archive("meta_archive.json")