# Hyperspy as an automatic metadata extractor

With this notebook, I'd like to assess whether or not the Hyperspy library may be implemented to assist the automatic metadata extraction from SEM and TEM research images and data, in order to more efficiently extract metadata from them. The advantage with using this package, is that it uses its own native schema such that the extracted metadata is already structured in a more or less standardized format. While this schema is useless to us, it does allow us to bypass creating a different map for nearly every single instrument as the input to the mapping service will always be in the same format.

### Load the required packages

In [2]:
import hyperspy.api as hs
import json
import os

Load a folder of test images to see how each reacts to the automatic data extraction

In [3]:
def getImages(folder_path):
    tiff_images_list = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path) and filename.lower().endswith('.tif'):
            tiff_images_list.append(os.path.join(folder_path, filename))
    return tiff_images_list

testFolder = '/Users/elias/Desktop/MatWerk_Projects/testImages'

images = getImages(testFolder)
images

['/Users/elias/Desktop/MatWerk_Projects/testImages/Na3FePO43 -04-Zeiss IAM-ESS.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/Nozzle Chip RIE KOH10 Zeiss EVO IMT.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/NK_PA07_1-4_KA-W 136980-Zeiss-IAM ESS.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/Au-Gr_06.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/SEM_image_sample_Thermo_Fisher_Helios_G4_PFIB_CXe.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/SEM_image_sample_FEI_Helios_Nanolab600.tif',
 '/Users/elias/Desktop/MatWerk_Projects/testImages/SEM Image 2 - SliceImage - 001.tif']

# Na3FePO43 -04-Zeiss IAM-ESS.tif 

Zeiss Merlin Instrument

In [None]:
f = hs.load(images[0])

In [None]:
f.original_metadata.as_dictionary()

We can see simply loading the tiff image and parsing it as a dictionary using built-in hyperspy functions formats it nicely. We can also see that each element of the schema can be found within it, meaning it successfully passes this test. The conversion table for the "variable names" for each metadata variable can be [found here](https://docs.google.com/spreadsheets/d/1f_9qKa2BbA5_q47ild_fZeQFKPUJKcxcbkYkhje0EF0/edit?usp=sharing).

# Nozzle Chip RIE KOH10 Zeiss EVO IMT.tif

Zeiss EVO instrument

In [None]:
f = hs.load(images[1])

In [None]:
f.original_metadata.as_dictionary()

The metadata is extracted from this next test image in exactly the same way with the exact same variable names.

# NK_PA07_1-4_KA-W 136980-Zeiss-IAM ESS.tif

Zeiss Supra 55 FE-SEM Instrument

In [None]:
f = hs.load(images[2])
f.original_metadata.as_dictionary()

This image also passes the test, but it's expected as it's the same instrument as the first image.

# Au-Gr_06.tif

Zeiss CNR-IOM instrument

In [None]:
f = hs.load(images[3])
f.original_metadata.as_dictionary()

This also works as necessary, with the same variable names.

# SEM_image_sample_Thermo_Fisher_Helios_G4_PFIB_CXe.tif

Helios PFIB instrument

In [23]:
f = hs.load(images[4])
f.original_metadata.as_dictionary()

{'ImageWidth': 1536,
 'ImageLength': 1094,
 'BitsPerSample': (8, 8, 8),
 'Compression': <COMPRESSION.NONE: 1>,
 'PhotometricInterpretation': <PHOTOMETRIC.RGB: 2>,
 'StripOffsets': (8,
  4616,
  9224,
  13832,
  18440,
  23048,
  27656,
  32264,
  36872,
  41480,
  46088,
  50696,
  55304,
  59912,
  64520,
  69128,
  73736,
  78344,
  82952,
  87560,
  92168,
  96776,
  101384,
  105992,
  110600,
  115208,
  119816,
  124424,
  129032,
  133640,
  138248,
  142856,
  147464,
  152072,
  156680,
  161288,
  165896,
  170504,
  175112,
  179720,
  184328,
  188936,
  193544,
  198152,
  202760,
  207368,
  211976,
  216584,
  221192,
  225800,
  230408,
  235016,
  239624,
  244232,
  248840,
  253448,
  258056,
  262664,
  267272,
  271880,
  276488,
  281096,
  285704,
  290312,
  294920,
  299528,
  304136,
  308744,
  313352,
  317960,
  322568,
  327176,
  331784,
  336392,
  341000,
  345608,
  350216,
  354824,
  359432,
  364040,
  368648,
  373256,
  377864,
  382472,
  387080,

Here we start getting a little bit of trouble. The output is nicely formatted, in a neat standardized way, but sadly, different from the previous entries. One could make a map for these too, but there also appear to be some components missing (units for some of the values, for one). It could be that some entries have names I don't recognize, and are therefore not missing, but I couldn't find some of the values.

# SEM_image_sample_FEI_Helios_Nanolab600.tif

Helios Nanolab instrument

In [25]:
f = hs.load(images[5])
f.original_metadata.as_dictionary()

{'NewSubfileType': <FILETYPE.UNDEFINED: 0>,
 'ImageWidth': 1024,
 'ImageLength': 943,
 'BitsPerSample': (8, 8, 8),
 'Compression': <COMPRESSION.NONE: 1>,
 'PhotometricInterpretation': <PHOTOMETRIC.RGB: 2>,
 'StripOffsets': (7750,
  10822,
  13894,
  16966,
  20038,
  23110,
  26182,
  29254,
  32326,
  35398,
  38470,
  41542,
  44614,
  47686,
  50758,
  53830,
  56902,
  59974,
  63046,
  66118,
  69190,
  72262,
  75334,
  78406,
  81478,
  84550,
  87622,
  90694,
  93766,
  96838,
  99910,
  102982,
  106054,
  109126,
  112198,
  115270,
  118342,
  121414,
  124486,
  127558,
  130630,
  133702,
  136774,
  139846,
  142918,
  145990,
  149062,
  152134,
  155206,
  158278,
  161350,
  164422,
  167494,
  170566,
  173638,
  176710,
  179782,
  182854,
  185926,
  188998,
  192070,
  195142,
  198214,
  201286,
  204358,
  207430,
  210502,
  213574,
  216646,
  219718,
  222790,
  225862,
  228934,
  232006,
  235078,
  238150,
  241222,
  244294,
  247366,
  250438,
  253510,


Same story here, except the metadata seems to be even more limited. In any case, it's not the same format as the first few, meaning another map has to be made.

# SEM Image 2 - SliceImage - 001.tif
This is an image from the PP13 SEM/FIB Tomography dataset.

In [3]:
f = hs.load(images[6])
f.original_metadata.as_dictionary()

{'ImageWidth': 855,
 'ImageLength': 770,
 'BitsPerSample': 8,
 'Compression': <COMPRESSION.NONE: 1>,
 'PhotometricInterpretation': <PHOTOMETRIC.MINISBLACK: 1>,
 'StripOffsets': (8,
  863,
  1718,
  2573,
  3428,
  4283,
  5138,
  5993,
  6848,
  7703,
  8558,
  9413,
  10268,
  11123,
  11978,
  12833,
  13688,
  14543,
  15398,
  16253,
  17108,
  17963,
  18818,
  19673,
  20528,
  21383,
  22238,
  23093,
  23948,
  24803,
  25658,
  26513,
  27368,
  28223,
  29078,
  29933,
  30788,
  31643,
  32498,
  33353,
  34208,
  35063,
  35918,
  36773,
  37628,
  38483,
  39338,
  40193,
  41048,
  41903,
  42758,
  43613,
  44468,
  45323,
  46178,
  47033,
  47888,
  48743,
  49598,
  50453,
  51308,
  52163,
  53018,
  53873,
  54728,
  55583,
  56438,
  57293,
  58148,
  59003,
  59858,
  60713,
  61568,
  62423,
  63278,
  64133,
  64988,
  65843,
  66698,
  67553,
  68408,
  69263,
  70118,
  70973,
  71828,
  72683,
  73538,
  74393,
  75248,
  76103,
  76958,
  77813,
  78668,
  7

Here is the same story. There is some additional metadata about the instrument hidden in a variable in XML format (as a string...), but it is not very useful data, it's missing units, etc. Meaning one has to either refer to some sort of external documentation in order to accurately map this metadata.

# TEM Images Metadata Extraction

As previously discussed, TEM image metadata needs to be extracted from accompanying files, as Hyperspy will only get the metadata from the TIFF image itself, and not the metadata that is "embedded" within it about the actual project. This is presumably by design, and simply because there isn't actually any project metadata embedded into the tiff's coming from TEM instruments. Luckily, the images are indeed usually (always?) accompanied by such files.

In [15]:
temMetadataDir = "/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema"

def getTemMetadata(folder_path):
    mnetadataFileList = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path) and filename.lower().endswith('.emd'):
            mnetadataFileList.append(os.path.join(folder_path, filename))
    return mnetadataFileList

temMetadata = getTemMetadata(temMetadataDir)
temMetadata

['/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema/NEW-dimple-polish-pips 20230302 1127 Camera 3800 x Ceta.emd',
 '/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema/NEW-dimple-polish-pips 20230302 1244 STEM 5300 x HAADF.emd',
 '/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema/NEW-dimple-polish-pips 20230302 1143 STEM 7500 x HAADF.emd',
 '/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema/NEW-dimple-polish-pips 20230302 1146 STEM 5300 x HAADF.emd',
 '/Users/elias/Desktop/MatWerk_Projects/images_to_try/Data for TEM-Schema/LaserCarbon_VS_Themis300-054 Camera 600 mm Ceta 20210421 1623.emd']

In this case, all the images are from the same instrument, so I only extracted the metadata from one to see how it behaves.

In [18]:
TEM = hs.load(temMetadata[0])
TEM.original_metadata.as_dictionary()

{'Core': {'MetadataDefinitionVersion': '7.9',
  'MetadataSchemaVersion': 'v1/2013/07',
  'guid': '00000000000000000000000000000000'},
 'Instrument': {'ControlSoftwareVersion': '2.15.3',
  'Manufacturer': 'FEI Company',
  'InstrumentId': '3900',
  'InstrumentClass': 'Titan',
  'InstrumentModel': 'Themis',
  'ComputerName': 'TITAN52339000'},
 'Acquisition': {'AcquisitionStartDatetime': {'DateTime': '1677752847'},
  'AcquisitionDatetime': {'DateTime': '1677752847'},
  'BeamType': '',
  'SourceType': 'XFEG'},
 'Optics': {'GunLensSetting': '3',
  'ExtractorVoltage': '4100',
  'AccelerationVoltage': '300000',
  'SpotIndex': '4',
  'C1LensIntensity': '-0.27627953886985779',
  'C2LensIntensity': '0.49731478095054626',
  'C3LensIntensity': '0.28566396236419678',
  'ObjectiveLensIntensity': '0.88349461555480957',
  'IntermediateLensIntensity': '-0.016146063804626465',
  'DiffractionLensIntensity': '0.32825303077697754',
  'Projector1LensIntensity': '0.3777472972869873',
  'Projector2LensIntensit

As no map file exists for a TEM Schema (which is still in the works, awaiting on feedback from INT colleagues), it is not yet verifiable if everything required by the schema is extracted. However, it can be immediately noted that the metadata read from the these `.emd` files is much more complete than that from the TIFFs. If such a file format can be provided by all instruments, it's likely that hyperspy reads it all the same, regardless of which instrument it came from. There are also additional metadata export formats beyond simple tiffs which are meant to accompany the research images generated, and therefore can be "linked" to the image in some way. These also provide us with more options and a better way to standardize the metadata extraction process by requiring a metadata file which is automatically generated by most, if not all, instruments (if possible of course). Currently awaiting feedback from the SEM/TEM experts on whether or not these metadata files may be generated along with the TIFFs on all instruments.

# Conclusion

While Hyperspy is not the end-all solution that we were hoping it would be, it does provide a significant boost and efficiency to the metadata extraction. I would indeed recommend that it be implemented, but it needs to be discussed with Reetu and Rossella in how exactly this could be done. We need to also wait on information from INT about whether or not these file formats can be easily provided, and whether this is even an acceptable solution needs to be discussed. 

Hyperspy does pave the way to a "universal" extractor, but it's not the magic tool we were hoping that it would turn out to be. Its extraction algorithm is eons ahead of my manual extraction, but more research is needed to understand exactly how it is completing this extraction under the hood and why some tiffs are treated differently and not extracted under the same schema as others. It appears that it has different structures/schemas for various metadata formats which it receives, but I'm still looking into it. The next steps would be to reformat the codebase of the existing metadata mappings to be more in line with the polished and published tools (i.e. Nicolas' DICOM mapper), and then look into how Hyperspy fits into this.