# Parser Interface
This notebook illustrates how to great MDF-ready metadata with MDF-MatIO

In [1]:
from mdf_matio.adapters import noop_parsers
from mdf_matio import get_mdf_parsers, generate_search_index
from materials_io.utils import interface as matio
from tarfile import TarFile
import pandas as pd
import os

## Get the Available Parsers
The MDF only uses a limited subset of the data available via each parser.
Consequently, the MDF interface to MaterialsIO only uses parsers for which we have defined this desired subset.

In [2]:
all_parsers = matio.get_available_parsers()
print(f'Found {len(all_parsers)} parsers:', set(all_parsers.keys()))

  warn('The libmagic library is not installed. '


Found 9 parsers: {'json', 'ase', 'em', 'crystal', 'noop', 'csv', 'dft', 'generic', 'image'}


One part of the MDF IO library is defining which parsers produce data in this format or a method for transforming the outputs of the data into a format compatible with the MDF's Search Index

In [3]:
print(f'Found {len(noop_parsers)} parsers that require zero alteration:', noop_parsers)

Found 2 parsers that require zero alteration: ['image', 'em']


In [4]:
mdf_parsers = get_mdf_parsers()
print(f'Found {len(mdf_parsers)} compatible parsers:', mdf_parsers)

Found 6 compatible parsers: {'json', 'em', 'csv', 'dft', 'generic', 'image'}


Some of these parsers require an "adapter" to transform the data into the MDF format.

## Demonstrate Adapters
A good example of a parser that generates data in a non-MDF format is the "generic file parser."

In [5]:
test_file = os.path.join('example-files', 'dog2.jpeg')

In [6]:
generic_parser = matio.get_parser('generic')

The generic parser produces the hashes for the data file and, if installed, autodetects the file format.

In [7]:
file_info = generic_parser.parse([test_file])
file_info

{'length': 269360,
 'filename': 'dog2.jpeg',
 'path': 'example-files\\dog2.jpeg',
 'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90'}

The MDF search index stores this information under the the "files" block. 
Our adapter to the "generic" parser performs this operation.

In [8]:
generic_adapter = matio.get_adapter('generic')

In [9]:
generic_adapter.transform(file_info)

{'files': [{'length': 269360,
   'filename': 'dog2.jpeg',
   'path': 'example-files\\dog2.jpeg',
   'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90',
   'data_type': 'Unknown'}]}

The advantage of this adapter is that the MDF need not implement the hashing or file-type detection framework. 
The only tool needed for using this Materials IO parser is some data reshaping - a much easier task.

## Parsing MDF-Compliant Data
The `generate_search_index` function uses these capabilities to automatically generate compliant data from a directory of files.  It determines which parsers are available, runs them on all data in a directory, applies the adapters, and then merges the metadata of files records that describe the same record (e.g., a single experiment or calculation). 

Unpacking VASP data. (It is large enough that we do not want to commit the uncompressed files to GitHub).

In [10]:
with TarFile.open(os.path.join('example-files', 'calc', 'AlNi_static_LDA.tar.gz')) as t:
    t.extractall(os.path.join('example-files', 'calc'))

Deploying the search tool

In [11]:
record_gen = generate_search_index(os.path.join('example-files'), False)
record_gen

<generator object generate_search_index at 0x000001FE4B45A9A8>

MaterialsIO uses generators to avoid needing to hold the entire dataset in memory at once.
Each metadata record is generated incremementally on-demand.

In [12]:
records = list(record_gen)
print(f'Generated {len(records)} records')



Generated 5 records


## Investigate Results of Parsing
There are 5 different records parsed from our example files.

For simplicity, we just print the paths of files associated with each record

In [13]:
def print_simple_paths(r):
    return [f['path'] for f in r['files']]

In [14]:
for i, r in enumerate(records):
    print(f'{i+1}:', print_simple_paths(r))

1: ['group-by-dir\\other-dir\\dog3.jpeg']
2: ['group-by-dir\\csv\\dog1.jpeg', 'group-by-dir\\csv\\test.csv']
3: ['group-by-dir\\dog1.jpeg', 'group-by-dir\\dog2.jpeg', 'group-by-dir\\mdf.json']
4: ['calc\\AlNi_static_LDA\\XDATCAR', 'calc\\AlNi_static_LDA\\POSCAR', 'calc\\AlNi_static_LDA\\OUTCAR', 'calc\\AlNi_static_LDA\\OSZICAR', 'calc\\AlNi_static_LDA\\KPOINTS', 'calc\\AlNi_static_LDA\\INCAR', 'calc\\AlNi_static_LDA\\DOSCAR', 'calc\\AlNi_static_LDA\\CONTCAR']
5: ['dog2.jpeg']


Our first record contains the several jpeg files that are grouped together.
Normally, they are JPEGs are not grouped together.
In this case the `mdf.json` file in that directory directs the parser to group records by directory.

In [15]:
with open(os.path.join('example-files', 'group-by-dir', 'mdf.json')) as fp:
    print(fp.read())

{
    "parse_by_directory": true
}


The configuration file applies to all subdirectories of `example-files/group-by-dir`

The other record that contains $>1$ file includes all of the files from a VASP calcualtion. 

In [16]:
dft_records = [x for x in records if 'dft' in x]

In [17]:
dft_records

[{'files': [{'length': 292,
    'filename': 'XDATCAR',
    'path': 'calc\\AlNi_static_LDA\\XDATCAR',
    'sha512': '5cc741db30247539aa75baa0510badd74e47d3fb966b0350d5d6914844222c9f4dc4b0d6bf9ae55bb04c0ed8e6b971ea522907e45b9c5bc942cd37c12be33056',
    'data_type': 'Unknown'},
   {'length': 189,
    'filename': 'POSCAR',
    'path': 'calc\\AlNi_static_LDA\\POSCAR',
    'sha512': '180f5c5e1d273b3c069f40d67a3f57add75e617df6a5d11c7b3ccd78c3db3ce4378bbb7106e27c4f1b744ba9e6da3538aa1362f6a1b4102a47f317d5540e2aaf',
    'data_type': 'Unknown'},
   {'length': 266416,
    'filename': 'OUTCAR',
    'path': 'calc\\AlNi_static_LDA\\OUTCAR',
    'sha512': 'cfcf0afc831204cb31354d7c03a8bc26cf4ac1c445afcc70dc336a19ed91338ab3d5f0556265dd50eb2bd18c7d4d69b1a882ee539de165ada6a3cfcae070a828',
    'data_type': 'Unknown'},
   {'length': 663,
    'filename': 'OSZICAR',
    'path': 'calc\\AlNi_static_LDA\\OSZICAR',
    'sha512': 'daedbf1bc47695fb5fc4a65f0a8ae8b60a74d84fde9c569e3d957a5ae87ebf21e5bdd482ae69e9760e7b

Note how this record contains >1 file and additional metadata for the DFT calculations

## Adding Mappings
The MDF Connect pipeline allows for users to specify which fields in certain types of files can be mapped to known fields in the MDF.
These instructions are included in the `index` field of a submission and our `generate_search_index` function takes this field without modification to define the mappings.

In [18]:
indexing = {
    'csv': {'mapping': {'material.composition': 'composition'}}
}

In [19]:
records = list(generate_search_index(os.path.join('example-files'), False, index_options=indexing))
print(f'Created {len(records)} records')



Created 7 records


Note that we created more records this time

In [20]:
for i, r in enumerate(records):
    print(f'{i+1}:', print_simple_paths(r))

1: ['group-by-dir\\other-dir\\dog3.jpeg']
2: ['group-by-dir\\csv\\dog1.jpeg', 'group-by-dir\\csv\\test.csv']
3: ['group-by-dir\\dog1.jpeg', 'group-by-dir\\dog2.jpeg', 'group-by-dir\\mdf.json']
4: ['calc\\AlNi_static_LDA\\XDATCAR', 'calc\\AlNi_static_LDA\\POSCAR', 'calc\\AlNi_static_LDA\\OUTCAR', 'calc\\AlNi_static_LDA\\OSZICAR', 'calc\\AlNi_static_LDA\\KPOINTS', 'calc\\AlNi_static_LDA\\INCAR', 'calc\\AlNi_static_LDA\\DOSCAR', 'calc\\AlNi_static_LDA\\CONTCAR']
5: ['test.csv']
6: ['test.csv']
7: ['dog2.jpeg']


You may notice that the the `test.csv` file appears twice

In [21]:
with open(os.path.join('example-files', 'test.csv')) as fp:
    print(fp.read())

composition,data
NaCl,4
LiFePO4,-1



This is because `test.csv` contains 2 records

In [22]:
[r for r in records if r['files'][0]['path'] == 'test.csv']

[{'material': {'composition': 'NaCl'},
  'files': [{'length': 38,
    'filename': 'test.csv',
    'path': 'test.csv',
    'sha512': 'c436c80612a7ac63545f10099aa0453238afe6d708c8606594d6bbceed46a3d6c956a62340851ff93452d5060791925530d145a8a4526ef893d51a010bcf141a',
    'data_type': 'Unknown'}]},
 {'material': {'composition': 'LiFePO4'},
  'files': [{'length': 38,
    'filename': 'test.csv',
    'path': 'test.csv',
    'sha512': 'c436c80612a7ac63545f10099aa0453238afe6d708c8606594d6bbceed46a3d6c956a62340851ff93452d5060791925530d145a8a4526ef893d51a010bcf141a',
    'data_type': 'Unknown'}]}]

As desired, the mapping we defined maps the data in the composition column into `materials.composition` and
there is one record per entry in the CSV file.

The `mdf_matio` adapter will also merge the metadata records produced for CSV files with those from other files.
An example is merging files in a directory that contains a CSV file and an image

In [23]:
merged_csv = [r for r in records if r['files'][0]['path'] == os.path.join('group-by-dir', 'csv', 'dog1.jpeg')][0]

In [24]:
print_simple_paths(merged_csv)

['group-by-dir\\csv\\dog1.jpeg', 'group-by-dir\\csv\\test.csv']

Here, the metadata from all other files are merged into the metadata of each records from the CSV file.
This CSV file happens to have only 1.

In [25]:
merged_csv

{'material': {'composition': 'NaCl'},
 'files': [{'length': 269360,
   'filename': 'dog1.jpeg',
   'path': 'group-by-dir\\csv\\dog1.jpeg',
   'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf279d281270211cff8f90',
   'data_type': 'Unknown'},
  {'length': 26,
   'filename': 'test.csv',
   'path': 'group-by-dir\\csv\\test.csv',
   'sha512': 'a6080a499a7fc72f103cc72352b30d6abd87d6eca4bcda7c792fa5a0d90c2834b68ab8ed425f736193a57ee21489581205eb72f17143363bb8ae3a68893fe959',
   'data_type': 'Unknown'}],
 'image': {'width': 1910,
  'height': 1000,
  'format': 'JPEG',
  'megapixels': 1.91}}

## JSON Files
Like CSV files, JSON files require a mapping in order to be parsed

In [26]:
records = list(generate_search_index(os.path.join('example-files', 'json'), False, index_options=indexing))
print(f'Created {len(records)} records')



Created 0 records


As before, we define our mapping following the same schema as the MDF Connect request

In [27]:
indexing = {
    'json': {
        'mapping': {'material.composition': 'composition', 'value': 'oqmd.delta_e'},
        'na_values': 'N/A'
    }
}

In [28]:
records = list(generate_search_index(os.path.join('example-files', 'json'), False, index_options=indexing))
print(f'Created {len(records)} records')



Created 5 records


In [29]:
for i, r in enumerate(records):
    print(f'{i+1}:', print_simple_paths(r))

1: ['line-delimited.json']
2: ['line-delimited.json']
3: ['list.json']
4: ['list.json']
5: ['simple.json']


The first case of a JSON file is one where we have a single object

In [30]:
list_file = os.path.join('example-files', 'json', 'simple.json')
with open(list_file) as fp:
    print(fp.read())

{"composition": "CuZr"}



In [31]:
[r for r in records if 'simple' in r['files'][0]['path']]

[{'material': {'composition': 'CuZr'},
  'files': [{'length': 25,
    'filename': 'simple.json',
    'path': 'simple.json',
    'sha512': 'af52bc1609a6a09851646a2524b1044f576c676a396a6dd1b0add67c83355c738a016dd0d5d4d9094e55cb34cdd9115d640fd1219278e718069df31ed0f6b3de',
    'data_type': 'Unknown'}]}]

Here, the mapping creates a single new record

Another option is for a JSON file to contain a list of multiple objects

In [32]:
list_file = os.path.join('example-files', 'json', 'list.json')
with open(list_file) as fp:
    print(fp.read())

[{"composition": "CuZr", "oqmd": {"delta_e": "N/A"}},
  {"composition": "Fe", "oqmd": {"delta_e": 0}}]



In [33]:
[r for r in records if 'list' in r['files'][0]['path']]

[{'material': {'composition': 'CuZr'},
  'files': [{'length': 105,
    'filename': 'list.json',
    'path': 'list.json',
    'sha512': '76f47619866e91da239aaa10bee41835e686392fe7d940e458f154d9c33701a4621ddafd13744354f6e5ccd0199a7f96cb37ef2f3f53cee408589e84ffd092c5',
    'data_type': 'Unknown'}]},
 {'material': {'composition': 'Fe'},
  'value': 0,
  'files': [{'length': 105,
    'filename': 'list.json',
    'path': 'list.json',
    'sha512': '76f47619866e91da239aaa10bee41835e686392fe7d940e458f154d9c33701a4621ddafd13744354f6e5ccd0199a7f96cb37ef2f3f53cee408589e84ffd092c5',
    'data_type': 'Unknown'}]}]

In this case, the parser and adapter recognize that the file contains multiple records.
Note that the `na_values` command in the indexing removes the missing value value from the first record.

Finally, the JSON parser also supports reading multiple records from line-delimited JSON files.

In [34]:
list_file = os.path.join('example-files', 'json', 'line-delimited.json')
with open(list_file) as fp:
    print(fp.read())

{"composition": "NaCl"}
{"composition":  "LiFePO4"}
{}



In [35]:
[r for r in records if 'line-' in r['files'][0]['path']]

[{'material': {'composition': 'NaCl'},
  'files': [{'length': 58,
    'filename': 'line-delimited.json',
    'path': 'line-delimited.json',
    'sha512': '7a450a893c60b1110dd6085f5b52661482249af3352a123b5660b5dffeb5550bf285755da65c4c589c70abdf6820d3a8061a2733f0c6a8af5f620228412856fd',
    'data_type': 'Unknown'}]},
 {'material': {'composition': 'LiFePO4'},
  'files': [{'length': 58,
    'filename': 'line-delimited.json',
    'path': 'line-delimited.json',
    'sha512': '7a450a893c60b1110dd6085f5b52661482249af3352a123b5660b5dffeb5550bf285755da65c4c589c70abdf6820d3a8061a2733f0c6a8af5f620228412856fd',
    'data_type': 'Unknown'}]}]

Note how it reads multiple records and skips the last line that does not match any mapping.