
<div style="
    background-color: #f7f7f7;
    background-image: url(''), url('') ;
    background-position: left bottom, right top;
    background-repeat: no-repeat,  no-repeat;
    background-size: auto 60px, auto 160px;
    border-radius: 5px;
    box-shadow: 0px 3px 1px -2px rgba(0, 0, 0, 0.2), 0px 2px 2px 0px rgba(0, 0, 0, 0.14), 0px 1px 5px 0px rgba(0,0,0,.12);">

<h1 style="
    color: #2a4cdf;
    font-style: normal;
    font-size: 2.25rem;
    line-height: 1.4em;
    font-weight: 600;
    padding: 30px 200px 0px 30px;"> 
        NOMAD as a Data Management Framework Tutorial</h1>

<p style="
    line-height: 1.4em;
    padding: 30px 200px 0px 30px;">
        This tutorial notebook demonstrates how to use NOMAD
        for managing custom data and file types. Based on a simple <i>Countries of the World</i>
        dataset, it shows how to model the data in a schema, do parsing and normalization,
        process data, access existing data with NOMAD's API for analysis, and how to
        add visualization to your data entries.
</p>

<p style="font-size: 1.25em; font-style: italic; padding: 5px 200px 30px 30px;">
    Markus Scheidgen, José A. Márquez</p>
</div>

In [None]:
# This is necessary in some development environments. You can ignore this!
from nomad.config import client
client.url = client.url.replace('://localhost', '://host.docker.internal')
# A utility to show structured data in cell outputs.
from IPython.display import JSON

# Content

- [Data](#Data)
- [Schema](#Schema)
- [Parsing](#Parsing)
- [Normalizing](#Normalizing)
- [Plugins](#Plugins)
- [Processing](#Processing)
- [Analysis](#Analysis)
- [Visualization](#Visualization)

## How to run this notebook

Ideally you are here, because you created the example upload *NOMAD as a Data Management Framework Tutorial* and you started the `Tutorial.ipynb` notebook. *No other preparation is required.*

Alternatively, you can download the [necessary files from gitlab](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/tree/develop/examples/data/cow_tutorial).
From the downloaded directory, you can run the `Tutorial.ipynb` using our Jupyterlab docker image:

```
docker run --rm -p 8888:8888 -v `pwd`:/home/jovyan/work \
    gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-remote-tools-hub/jupyterlab:latest
```

If you want to run this with your jupyter or any other solution, 
you need to install the `nomad-lab` pypi package and your OS also needs to have libmagic installed. 

```
sudo apt-get install --yes --quiet --no-install-recommends libmagic-dev
pip install nomad-lab
```

<div class="alert alert-block alert-warning">
The cells is this notebook are not independent and have to be run in order.
</div>

<div style="height: 4rem;">&nbsp;</div>

## Data

Here is some example data in a proprietary text file format: [Germany.data.txt](raw_data/Germany.data.txt). The data combines two public kaggle datasets ([1](https://www.kaggle.com/datasets/fernandol/countries-of-the-world), [2](https://www.kaggle.com/datasets/kaggle/world-development-indicators)); the orignal data files and a notebook to create the `.data.txt` files can be [downloaded here](https://datashare.mpcdf.mpg.de/s/CKgf3TZ7TtxB2P1). Now image we have such a file for each country in the world.

<div style="height: 4rem;">&nbsp;</div>

## Schema

You might also read [How to write a schema package](https://nomad-lab.eu/prod/v1/staging/docs/howto/plugins/schema_packages.html) or [Structured data](https://nomad-lab.eu/prod/v1/staging/docs/explanation/data.html) from the NOMAD documentation.

With a first impression on the data, we start to design a *schema*. Schema describes a data structure that can be used to instantiate NOMAD entries. The basic building blocks for schemas are:

  - *sections*: Containers for data. Instantiated from the `MSection` class.
  - *quantities*: Concrete data values. Instantiated from the `Quantity` class.
  - *sub-sections*: Used to nest sections within each other. Instantiated from the `SubSection` class
  - *schemas*: Top-level sections from which NOMAD entries are created. Instantiated from the `Schema` class.
  
Here is an example of a schema definition as a Python class:

In [None]:
from nomad.metainfo import Quantity
from nomad.datamodel import Schema
import numpy as np


class Country(Schema):
    name = Quantity(type=str)
    population = Quantity(type=np.int32)
    area = Quantity(type=np.float64, unit='km^2')

Now we can instantiate this schema with some data. In `Germany.data.txt`, we find something like this:

```
#Country=Germany
#Region=WESTERN EUROPE
#Population=82422299
#Area (sq. mi.)=357021
#Pop. Density (per sq. mi.)=230,9
#Coastline (coast/area ratio)=0,67
```

Let's use some of this information to populate the data. Schemas and other sections can be instantiated like normal Python classes. Quantities are passed as constructor keyword arguments or by assigning values to fields.

In [None]:
from nomad.units import ureg

example = Country(
    name='Germany',
    population=82422299
)
example.area = area=357021 * ureg('mi^2')

example.m_to_dict()

Above, we put only the needed technical information. To make the definitions more useful to human users, you can also add documentation in natural language with `description` or link related resources with `links`.

In [None]:
from nomad.metainfo import Section


class Country(Schema):
    ''' This section represents a country of the world. '''
    m_def = Section(links=[
        'https://www.kaggle.com/datasets/fernandol/countries-of-the-world'])

    name = Quantity(
        type=str,
        description='The country\'s name.')
    population = Quantity(
        type=np.int32,
        description='The country\'s population.')
    area = Quantity(
        type=np.float64, unit='km^2',
        description='The are of the country.')

***
Above we only used scalar quantities (single values). How can we model a time series? Let's use this to also demonstrate *sub sections*. Let's say we want to add multiple time series as sub-sections to our more general `Country` schema. 

We need to define a new class for a new *section* `Timeseries` and then use `Timeseries` and add `SubSection`s in our `Country` class:

In [None]:
from nomad.metainfo import MSection


class Timeseries(MSection):
    year = Quantity(type=np.int32, shape=['*'])
    value = Quantity(type=np.float64, shape=['*'])

In [None]:
from nomad.metainfo import SubSection


class Country(Schema):
    name = Quantity(type=str)
    population = Quantity(type=np.int32)
    area = Quantity(type=np.float64, unit='km^2')

    gdp = SubSection(
        section=Timeseries,
        description='GDP per capita (constant 2005 US$)'
    )
    birth_rate = SubSection(
        section=Timeseries,
        description='per 1,000 people per year'
    )

***
In order to use such custom schemas in NOMAD, we have to bundle all the definitions in a *schema package*. Such schema packages can be distributed as NOMAD Plugins, or uploaded a `.archive.json/yaml` files. For our use case, we will convert our Python definitions into the corresponding YAML version and save it directly into the current upload:

In [None]:
from nomad.metainfo import SchemaPackage
from nomad.datamodel import EntryArchive
import yaml

def create_schema_package():
    return EntryArchive(
        definitions=SchemaPackage(
            name='Countries of the World',
            sections=[
                Country.m_def, Timeseries.m_def
            ]
        )
    )

def save_schema_package_to_yaml():
    with open('schema_package.archive.yaml', 'wt') as f:
        f.write(yaml.dump(create_schema_package().m_to_dict(with_out_meta=True), indent=2))

save_schema_package_to_yaml()
print(yaml.dump(create_schema_package().m_to_dict(with_out_meta=True), indent=2))

The full schema in a Python file (as you would have it in a NOMAD plugin) would look like this: [country.py](nomad-countries/src/nomad_countries/schema_packages/country.py). And in a `.yaml` file (as you would upload it to NOMAD), will look like this: [schema_package.archive.yaml](schema_package.archive.yaml).

<div style="height: 4rem;">&nbsp;</div>

## Normalizing

Why then not just always write the schema in `yaml`?

In Python, we can add `normalize` functions to our schema. These allow us to add additional processing steps, for example, to augment our data.

In [None]:
class Country(Schema):
    name = Quantity(type=str)
    population = Quantity(type=np.int32)
    area = Quantity(type=np.float64, unit='km^2')
    population_density = Quantity(type=np.float64, unit='1/km^2')

    gdp = SubSection(
        section=Timeseries,
        description='GDP per capita (constant 2005 US$)'
    )
    birth_rate = SubSection(
        section=Timeseries,
        description='per 1,000 people per year'
    )

    def normalize(self, archive, logger):
        self.population_density = self.population / self.area

save_schema_package_to_yaml()

In [None]:
example = Country(
    name='Germany',
    population=82422299,
    area=(357021 * ureg('mi^2'))
)

example.normalize(None, None)
example.m_to_dict()

This is an extremely simple example, but `normalize` functions can be incredible powerful as they allow to incorporate custom Python code into NOMAD's data processing. This can be used for example to fit your data on the fly, to add derived quantities, or even retrieve data drom external APIs.

<div style="height: 4rem;">&nbsp;</div>

## Parsing

You might also read [From file to data](https://nomad-lab.eu/prod/v1/staging/docs/explanation/basics.html)
or [How to write a parser](https://nomad-lab.eu/prod/v1/staging/docs/howto/plugins/parsers.html) from the NOMAD
documentation. 

We don't want to always create schema instances by hand, we want to automatize and write a parser that populates the schema with data from a file as soon as it gets detected by a NOMAD installation (for example, via drag'n dropping or uploading it via the NOMAD API).

A parser *reads* the contents from a file and *writes* the data in the NOMAD format based on a schema into an *archive*. The signature for a `parse` function, i.e. within a NOMAD plugin, is this:

In [None]:
def parse(mainfile, archive, logger):
    # fill the archive: EntryArchive and return
    pass

We can extract the parsing part that deals with the file format in a `read` function. This would allows us to have different `read` functions for slightly different file formats, while re-using the part that populates the schema.

In [None]:
import re

def read(mainfile):
    data = {}
    with open(mainfile, 'rt') as f:
        while True:
            line = f.readline()
            match = re.match(r'#([^=]+)=(.+)', line)
            if not match:
                break
            key, str_value = match.group(1), match.group(2)
            try:
                value = float(str_value.replace(',', '.'))
            except Exception:
                value = str_value

            data[key] = value
    return data

read('raw_data/Germany.data.txt')

In the actual `parse` function call, we use the `read` function and populate the given `archive` with the data.

In [None]:
def parse(mainfile, archive, logger):
    data = read(mainfile)
    archive.data = Country(
        name=data['Country'],
        population=data['Population'],
        area=data['Area (sq. mi.)']
    )

To call the `parse` function, we create an empty `EntryArchive` for the `archive` argument. The real NOMAD processing will also provide a `logger` that can be used to report parsing problems.

In [None]:
from nomad.datamodel import EntryMetadata

archive = EntryArchive(metadata=EntryMetadata())
parse('raw_data/Germany.data.txt', archive, None)

JSON(archive.m_to_dict())

After parsing we can *normalize* the archive to call our `normalize` functions. There is an utility called `normalize_all` in the NOMAD Python package that allows to call the *normalization*.

In [None]:
from nomad.client import normalize_all

normalize_all(archive)

JSON(archive.m_to_dict())

<div style="height: 4rem;">&nbsp;</div>

## Plugins

Above, we showed how to write a simple `parse` function to learn about what a parser does. To add such a custom parser to a NOMAD installation, we will need to develop a *plugin*. You can read [how to get started with plugins](https://nomad-lab.eu/prod/v1/staging/docs/howto/plugins/plugins.html) to learn more about plugin development, but we have included an example of a plugin in the `nomad-countries` folder in this upload. In this more complete example there is additional functionality to parse the csv part and populate the `gdp` and `birth_rate` sub sections. Feel free to explore the code inside the `nomad-countries` folder to learn more.

We can actually install this complete plugin into this Jupyter environment by installing it with `pip`:

In [None]:
pip install -e ./nomad-countries

<div class="alert alert-block alert-warning">
<b>Attention:</b> Before the next cells will work, you have to restart the Python kernel. You can do this in "Kernel->Restart Kernel" or by using the "00" hotkey.
</div>

Once the package is installed and you have restarted the kernel, we can use the [parsing programming interface described in the documentation](https://nomad-lab.eu/prod/v1/staging/docs/howto/programmatic/local_parsers.html#from-a-python-program) to run the parser on files:

In [None]:
from nomad.client import parse, normalize_all
from IPython.display import JSON

archive = parse('raw_data/Germany.data.txt')[0]
normalize_all(archive)

JSON(archive.m_to_dict())

Or we use the `nomad` shell command, i.e. the command line interface (CLI):

In [None]:
!PYTHONPATH=. nomad parse raw_data/Germany.data.txt

<div style="height: 4rem;">&nbsp;</div>

## Processing

You might also read [Processing](https://nomad-lab.eu/prod/v1/staging/docs/explanation/processing.html)
from the NOMAD documentation. 

With a NOMAD that uses the *Countries of the World* plugin, we would simply upload `*.country.txt`, and NOMAD would process them for us by *matching* the files to our parser, doing the *parsing*, and *normalizing*. Finally NOMAD would persist the results. 

Without our own NOMAD, we can still emulate the process and create `*.archive.json` files. We apply the code from before to a few of the countries. 

There is one technicality we have to change to prepare the `.archive.json` files. NOMAD needs to know what schema we are using. Because we are using a Python schema, the exported `.json` will contain a references to a Python class (`data.m_def=nomad_countries.schema_packages.country.Country`) and we have to change it to a reference for the `.yaml` schema that we created earlier:

In [None]:
import json

for country in ('Germany', 'Poland', 'France'):
    # Process the country file
    archive = parse(f'raw_data/{country}.data.txt')[0]
    normalize_all(archive)
    json_data = archive.m_to_dict()

    # Here we replace the schema reference
    json_data['data']['m_def'] = \
        '../upload/raw/schema_package.archive.yaml#/definitions/section_definitions/Country'

    # Save the country as a .archive.json
    with open(f'nomad_data/{country}.archive.json', 'wt') as f:
        f.write(json.dumps(json_data, indent=2))

<div style="height: 4rem;">&nbsp;</div>

## Analysis

You might also read [How to use the API](https://nomad-lab.eu/prod/v1/staging/docs/howto/programmatic/api.html) or [How to access processed data](https://nomad-lab.eu/prod/v1/staging/docs/howto/programmatic/archive_query.html).
from the NOMAD documentation. 

<div class="alert alert-block alert-warning">
<b>Attention:</b> Before the next cells will work, you have to go back to NOMAD. On the upload page, press the reprocess button on the top-right. 
</div>

Analysis means you need to access the processed data from NOMAD. There are two principle ways. You can use a generic HTTP library like `requests` to use our RESTful API directly, or you use our client library form the `nomad-lab` Python package.

Below, we are using `requests` to perform a query and retrieve the id for our "uploaded" `.yaml` schema. You can learn more about our API endpoints on the [API dashboard](https://nomad-lab.eu/prod/v1/api/v1/extensions/docs). With `requests` you get the raw API responses as JSON.

In [None]:
import requests
from nomad.config import client
from nomad.client import Auth

response = requests.post(f'{client.url}/v1/entries/query', auth=Auth(), json={
  'owner': 'user',
  'query': {
    'mainfile': 'schema_package.archive.yaml',
    'upload_name': 'NOMAD as a Data Management Framework Tutorial'
  }
})
schema_entry_id = response.json()['data'][0]['entry_id']

Below, we use the `ArchiveQuery` utility. It allows you to `query` for entries and access the `required` parts of the processed data at the same time. With `ArchiveQuery` you retrieve Python objects that instantiate the respective schema.

In [None]:
from nomad.client import ArchiveQuery

archive_query = ArchiveQuery(
    query={
       f'data.population#entry_id:{schema_entry_id}.Country#int:gt': 50e6,
       'upload_name': 'NOMAD as a Data Management Framework Tutorial'
    },
    required={
        'data': '*'
    }
)

countries = [entry.data for entry in archive_query.download()]

With the data available, you can perform your analysis on top of the data. For example we can plot the data with `plotly`.

In [None]:
import plotly.express as px
import pandas as pd

px.line(
    pd.concat([
        pd.DataFrame(dict(
            year=country.gdp.year,
            GDP=country.gdp.value,
            name=country.name
        ))
        for country in countries
    ]),
    x='year', y=['GDP'], color='name'
).show()


In [None]:
px.line(
    pd.concat([
        pd.DataFrame(dict(
            year=country.birth_rate.year,
            birth_rate=country.birth_rate.value,
            name=country.name
        ))
        for country in countries
    ]),
    x='year', y='birth_rate', color='name'
).show()

<div style="height: 4rem;">&nbsp;</div>

## Adding Visualization to your NOMAD entries

You might also read the reference on [Plot Annotations](https://nomad-lab.eu/prod/v1/staging/docs/reference/annotations.html#plot)
from the NOMAD documentation. 

We can also put visualizations into the schema, allowing the NOMAD UI to show them. We can either add a schema *annotation* that informs the UI how to do the visualization, or we can add a Plotly figure to our data and let the UI simply show it.

### Plot annotation

Schema annotation can be added to Python schemas and `.yaml` schemas as well. They do not require any Python code to run and also work for uploaded schemas and do not require a plugin. The plot annotations require that your section inherits from `PlotSection`. This is how the annotation looks in a `.yaml` schema:

```yaml

definitions:
  name: Countries of the World
  section_definitions:
  - base_sections:
    - nomad.datamodel.metainfo.plot.PlotSection
    - nomad.datamodel.data.EntryData
    m_annotations:
      plotly_graph_object:
      - data:
          x: '#birth_rate/year'
          y: '#birth_rate/value'
        layout:
          yaxis:
            title: birth rate (per 1,000 people)
    name: Country
    quantities:
       ...
```

You can also add annotation in Python schemas:

In [None]:
import numpy as np
from nomad.metainfo import Section, Quantity, SubSection, MSection
from nomad.datamodel.data import Schema
from nomad.datamodel.metainfo.plot import PlotSection


class Timeseries(MSection):
    year = Quantity(type=np.int32, shape=['*'])
    value = Quantity(type=np.float64, shape=['*'])


class Country(PlotSection, Schema):
    m_def=Section(a_plotly_graph_object={
        'data': {
            'x': '#birth_rate/year',
            'y': '#birth_rate/value'
        },
        'layout': {
            'yaxis': {
                'title': 'birth rate (per 1,000 people)'
            }
        }
    })

    name = Quantity(type=str)
    population = Quantity(type=np.int32)
    area = Quantity(type=np.float64, unit='km^2')
    population_density = Quantity(type=np.float64, unit='1/km^2')

    gdp = SubSection(
        section=Timeseries,
        description='GDP per capita (constant 2005 US$)'
    )
    birth_rate = SubSection(
        section=Timeseries,
        description='per 1,000 people per year'
    )

    def normalize(self, archive, logger):
        self.population_density = self.population / self.area

The annotation is based on Plotly [graph objects](https://plotly.com/python/graph-objects/). The `layout` is passed directly to Plotly, the `data` allows to reference quantities in the data. Let's save this new version:

In [None]:
from nomad.metainfo import SchemaPackage
from nomad.datamodel import EntryArchive
import yaml

def create_schema_package():
    return EntryArchive(
        definitions=SchemaPackage(
            name='Countries of the World',
            sections=[Country.m_def, Timeseries.m_def]
        )
    )

def save_schema_package_to_yaml():
    with open('schema_package.archive.yaml', 'wt') as f:
        f.write(yaml.dump(create_schema_package().m_to_dict(with_out_meta=True), indent=2))

save_schema_package_to_yaml()

You can go back to NOMAD and reprocess the upload. Look at one of the Country entries to see the plot.

### Creating custom plots programmatically

You can also create Plotly plots during the processing as part of a `normalize` function. This gives you the full functionality of plotly and you simply store the results via Plotly's `to_plotly_json` function. We are using the base class `PlotSection` that provides a `figures` property to store figures.

In [None]:
from nomad.datamodel.metainfo.plot import PlotSection, PlotlyFigure


class Country(PlotSection, Schema):
    name = Quantity(type=str)
    population = Quantity(type=np.int32)
    area = Quantity(type=np.float64, unit='km^2')
    population_density = Quantity(type=np.float64, unit='1/km^2')

    gdp = SubSection(
        section=Timeseries,
        description='GDP per capita (constant 2005 US$)'
    )
    birth_rate = SubSection(
        section=Timeseries,
        description='per 1,000 people per year'
    )

    def normalize(self, archive, logger):
        super(Country, self).normalize(archive, logger)
        self.population_density = self.population / self.area

        self.figures.append(PlotlyFigure(
            figure=px.line(
                pd.DataFrame(dict(year=self.gdp.year, GDP=self.gdp.value)),
                x='year', y=['GDP']
            ).to_plotly_json()
        ))

save_schema_package_to_yaml()

You can go back to NOMAD and reprocess the upload. Look at one of the Country entries to see the plot.

<div style="height: 4rem;">&nbsp;</div>

You reached the end of this notebook. Here are some useful links:

- [nomad-lab.eu](https://nomad-lab.eu)
- [NOMAD Documentation](https://nomad-lab.eu/docs)
- [Our user forums](https://matsci.org/c/nomad/32)