# Analyzing machine learning model training results #

## A NOTE BEFORE STARTING ##

Since the ``emicroml`` git repository tracks this notebook under its original
basename ``analyzing_ml_model_training_results.ipynb``, we recommend that you
copy the original notebook and rename it to any other basename that is not one
of the original basenames that appear in the ``<root>/examples`` directory
before executing any of the notebook cells below, where ``<root>`` is the root
of the ``emicroml`` repository. For example, you could rename it
``analyzing_ml_model_training_results_sandbox.ipynb``. This way you can explore
the notebook by executing and modifying cells without changing the original
notebook, which is being tracked by git.

## Import necessary modules ##

In [None]:
# For pattern matching.
import re

# For listing files and subdirectories in a given directory, and for renaming
# directories.
import os



# For general array handling.
import numpy as np
import torch

# For loading objects from HDF5 files.
import h5pywrappers

# For creating and plotting figures.
import matplotlib.pyplot as plt
import matplotlib.ticker



# For loading ML datasets and models for distortion estimation in CBED.
import emicroml.modelling.cbed.distortion.estimation

In [None]:
%matplotlib ipympl
%matplotlib ipympl

## Introduction ##

In this notebook, we analyze the output that results from performing the
"actions" described in the following pages:

1. [Generating machine learning datasets for training and validation](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/generate_ml_datasets_for_training_and_validation.html)
2. [Combining then splitting machine learning datasets for training and validation](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/combine_ml_datasets_for_training_and_validation_then_split.html)
3. [Training machine learning models](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/train_ml_model_set.html)

while also demonstrating how one can use a selection of the functions and
classes in the module
[emicroml.modelling.cbed.distortion.estimation](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.html#module-emicroml.modelling.cbed.distortion.estimation).
In short, in this notebook we analyze machine learning (ML) model training
results for the ML task of estimating distortion in convergent beam electron
diffraction (CBED).

In order to execute the cells in this notebook as intended, a set of Python
libraries need to be installed in the Python environment within which the cells
of the notebook are to be executed. See [this
page](https://mrfitzpa.github.io/emicroml/examples/prerequisites_for_execution_without_slurm.html)
for instructions on how to do so. Additionally, a subset of the output that
results from performing the aforementioned actions is required to execute the
cells in this notebook as intended. One can obtain this subset of output by
executing said actions, however this requires significant computational
resources, including significant walltime. Alternatively, one can copy this
subset of output from a Federated Research Data Repository dataset by following
the instructions given on [this
page](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/copying_subset_of_output_from_frdr_dataset.html).

You can find the documentation for the ``emicroml`` library
[here](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.html).
It is recommended that you consult the documentation of this
library as you explore the notebook. Moreover, users should
execute the cells in the order that they appear, i.e. from top to
bottom, as some cells reference variables that are set in other
cells above them. **Users should make sure to navigate the
documentation for the version of ``emicroml`` that they are
currently using.**

## Loading and analyzing the ML training dataset ##

Upon successful completion of the action described in the page [Combining then
splitting machine learning datasets for training and
validation](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/combine_ml_datasets_for_training_and_validation_then_split.html),
ML training and validation datasets are stored in the HDF5 files at the file
paths ``../data/ml_datasets/ml_dataset_for_training.h5`` and
``../data/ml_datasets/ml_dataset_for_validation.h5`` respectively.

Let's look at some of the ML data instances stored in the ML training dataset,
namely the first five to start:

In [None]:
path_to_data_dir = "../data"
path_to_ml_dataset = (path_to_data_dir
                      + "/ml_datasets/ml_dataset_for_training.h5")

module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"path_to_ml_dataset": path_to_ml_dataset, 
          "entire_ml_dataset_is_to_be_cached": False, 
          "ml_data_values_are_to_be_checked": False}
ml_dataset = module_alias.MLDataset(**kwargs)



single_dim_slice = slice(0, 5)

kwargs = {"single_dim_slice": single_dim_slice, 
          "unnormalize_normalizable_elems": True}
ml_data_instances = ml_dataset.get_ml_data_instances(**kwargs)

Next, let's print the names of all the features of the ML data instances:

In [None]:
for key in ml_data_instances:
    print(key)

The features of the ML data instances are described in detail
[here](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset.html).

The normalizable features/elements of ML data instances stored in the HDF5 files
are expected to be min-max normalized. To determine whether a feature is
normalizable, simply check either the normalization weights or biases of the ML
dataset:

In [None]:
ml_dataset.normalization_weights

If the name of a feature appears in the above dictionary, then that feature is
normalizable.

In the first code block of this section, we loaded a subset of ML data instances
and then subsequently unnormalized the normalizable features of said
subset. This was done in a single call to the method
[emicroml.modelling.cbed.distortion.estimation.MLDataset.get_ml_data_instances](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLDataset.html#emicroml.modelling.cbed.distortion.estimation.MLDataset.get_ml_data_instances).

Alternatively, we can do this in two steps:

In [None]:
kwargs = {"single_dim_slice": single_dim_slice, 
          "unnormalize_normalizable_elems": False}
ml_data_instances = ml_dataset.get_ml_data_instances(**kwargs)

# The following code block modifies ``ml_data_instances`` in place.
module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_data_dict": ml_data_instances,
          "normalization_weights": ml_dataset.normalization_weights,
          "normalization_biases": ml_dataset.normalization_biases}
module_alias.unnormalize_normalizable_elems_in_ml_data_dict(**kwargs)

For completeness, we show the inverse operation of normalizing the unnormalized
normalizable features:

In [None]:
module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_data_dict": ml_data_instances,
          "normalization_weights": ml_dataset.normalization_weights,
          "normalization_biases": ml_dataset.normalization_biases}
module_alias.normalize_normalizable_elems_in_ml_data_dict(**kwargs)

Rather than load the subset of ML data instances as a dictionary of arrays, we
can load the subset as a sequence `HyperSpy` signals:

In [None]:
cbed_pattern_image = ml_data_instances["cbed_pattern_images"][0]
sampling_grid_dims_in_pixels = cbed_pattern_image.shape

kwargs = \
    {"single_dim_slice": single_dim_slice, 
     "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
ml_data_instances_as_signals = \
    ml_dataset.get_ml_data_instances_as_signals(**kwargs)

Let's plot one of the signals, which represents a single ML data instance:

In [None]:
cbed_pattern_idx = 0

ml_data_instance_as_signal = ml_data_instances_as_signals[cbed_pattern_idx]

kwargs = {"axes_off": True, 
          "scalebar": False, 
          "colorbar": False, 
          "gamma": 0.2,
          "cmap": "plasma", 
          "title": ""}
ml_data_instance_as_signal.plot(**kwargs)

And here's the signal's metadata:

In [None]:
ml_data_instance_as_signal.metadata

See
[here](https://mrfitzpa.github.io/fakecbed/_autosummary/fakecbed.discretized.CBEDPattern.html#fakecbed.discretized.CBEDPattern.signal)
for a detailed description of the signal data and metadata. See also
[here](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLDataset.html#emicroml.modelling.cbed.distortion.estimation.MLDataset.get_ml_data_instances_as_signals)
for additional context. Note that some, if not many, of these simulated CBED
patterns do not appear realistic. This is deliberate as we are primarily
concerned with generating ML datasets storing simulated CBED patterns that
capture the essential geometric features of real CBED patterns, namely perfectly
circular CBED disks in the absence of distortion. The placement of the CBED
disks are often unusual, but this is also deliberate.

An alternative way to load a subset of ML data instances as a sequence of
`HyperSpy` signals is as follows:

In [None]:
kwargs = {"single_dim_slice": single_dim_slice, 
          "unnormalize_normalizable_elems": True}
ml_data_instances = ml_dataset.get_ml_data_instances(**kwargs)

module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_data_dict": ml_data_instances,
          "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
ml_data_instances_as_signals = module_alias.ml_data_dict_to_signals(**kwargs)

## Loading and analyzing ML model training summary output data ##

Upon successful completion of the action described in the page [Training machine
learning
models](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/train_ml_model_set.html),
10 ML models are trained, and a dictionary representation of each ML model is
saved to a file. Moreover, the ML model training summary output data for each
trained ML model is saved to a HDF5 file. ML model training summary output files
are described in detail in the documentation for the method
[emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.train_ml_model](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.html#emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.train_ml_model).
Let's examine the contents of one such HDF5 file.

Let's start by looking at the ML model training parameters used to train one of
the ML models. Said parameters have been serialized and stored in the HDF5 file:

In [None]:
path_to_ml_model_training_summary_output_data = \
    (path_to_data_dir
     + "/ml_models/ml_model_1"
     + "/ml_model_training_summary_output_data.h5")



kwargs = {"filename": path_to_ml_model_training_summary_output_data,
          "path_in_file": "ml_model_trainer_params"}
json_document_id = h5pywrappers.obj.ID(**kwargs)

serializable_rep_of_ml_model_trainer_params = \
    h5pywrappers.json.document.load(json_document_id)

serializable_rep_of_ml_model_trainer_params

The above output is a serialized representation of the ML model trainer
parameters used to train the ML model. ML model trainers are represented by the
[emicroml.modelling.cbed.distortion.estimation.MLModelTrainer](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.html)
class. Instances of this class are used to train ML models. We can construct an
instance of this class using the above serialized representation of the ML model
trainer parameters, as long as the parameter values are valid, including those
that specify the paths to the ML training and validation datasets (to be)
used. It's possible that these paths are invalid in
``serializable_rep_of_ml_model_trainer_param`` as they may refer to temporary
locations of the ML training and validation datasets during training, e.g. if
the training was performed in a SLURM job. For demonstration purposes, let's
reset these paths to ensure a successful construction of an instance of the
above class:

In [None]:
phases = ("training", "validation")

for phase in phases:
    key_1 = "ml_dataset_manager"
    key_2 = "ml_{}_dataset".format(phase)
    key_3 = "path_to_ml_dataset"

    path_to_ml_dataset = (path_to_data_dir
                          + "/ml_datasets"
                          + "/ml_dataset_for_{}.h5".format(phase))

    serializable_rep_of_ml_model_trainer_params[key_1][key_2][key_3] = \
        path_to_ml_dataset

Now we can construct the ML model trainer using the serialized representation of
the ML model trainer parameters (note that this may take some time):

In [None]:
module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"serializable_rep": serializable_rep_of_ml_model_trainer_params}
ml_model_trainer = module_alias.MLModelTrainer.de_pre_serialize(**kwargs)

Apart from parameters specifying paths, this ML model trainer has the same
parameters as those used to train the ML model referenced in the page [Training
a machine learning
model](https://mrfitzpa.github.io/emicroml/examples/modelling/cbed/distortion/estimation/train_ml_model_set.html).

From the ML model training summary output data file, we can also extract the
number of training and validation mini-batches processed per epoch:

In [None]:
for phase in phases:
    hdf5_dataset_path = "num_{}_mini_batches_per_epoch".format(phase)
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    num_mini_batches_per_epoch = h5pywrappers.dataset.load(**kwargs)[()]

    unformatted_msg = "# {} mini-batches per epoch: {}"
    msg = unformatted_msg.format(phase, num_mini_batches_per_epoch)
    print(msg)

We can also extract the learning rate schedules used:

In [None]:
kwargs = {"filename": path_to_ml_model_training_summary_output_data,
          "path_in_file": "/lr_schedules/lr_schedule_0"}
hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
lr_schedule = h5pywrappers.dataset.load(**kwargs)[()]



kwargs = {"obj_id": hdf5_dataset_id, "attr_name": "dim_0"}
attr_id = h5pywrappers.attr.ID(**kwargs)

kwargs = {"attr_id": attr_id}
dim_0_of_lr_schedule = h5pywrappers.attr.load(**kwargs)



y = lr_schedule
x = np.arange(y.size).astype("float")
if dim_0_of_lr_schedule == "training mini batch instance idx":
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": "/num_training_mini_batches_per_epoch"}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    num_mini_batches_per_epoch = h5pywrappers.dataset.load(**kwargs)[()]

    x /= num_mini_batches_per_epoch.astype("float")



fig, ax = plt.subplots()

ax.plot(x, y)

axis_label_font_size = 15
ax.set_xlabel("epoch", 
              fontsize=axis_label_font_size)
ax.set_ylabel("learning rate (dimensionless)", 
              fontsize=axis_label_font_size)

ax.set_yscale('log')

for spatial_dim in ("x", "y"):
    major_tick_width = 1.5
    major_tick_length = 8
    minor_tick_width = major_tick_width
    minor_tick_length = major_tick_length//2
    tick_label_size = 15
    linewidth = major_tick_width
        
    kwargs = {"axis": spatial_dim,
              "which": "major",
              "direction": "in",
              "left": True,
              "right": True, 
              "width": major_tick_width, 
              "length": major_tick_length, 
              "labelsize": tick_label_size}
    ax.tick_params(**kwargs)

    kwargs["which"] = "minor"
    kwargs["width"] = minor_tick_width
    kwargs["length"] = minor_tick_length
    ax.tick_params(**kwargs)

    for side in ['top','bottom','left','right']:
        ax.spines[side].set_linewidth(linewidth)

plt.tight_layout()
plt.show()

We can also extract the metrics used to track ML model performance during
training. In the cell below, for both the training and validation phases, we
plot the first, second, and third quartiles per epoch of the end-point error
(EPE) of the predicted "adjusted" standard distortion field:

In [None]:
fig, ax = plt.subplots()



serializable_rep_of_ml_dataset_manager = \
    serializable_rep_of_ml_model_trainer_params["ml_dataset_manager"]
mini_batch_size = \
    serializable_rep_of_ml_dataset_manager["mini_batch_size"]



phase_to_color_map = {"training": "yellow", "validation": "green"}



for phase in phase_to_color_map:
    hdf5_dataset_path = ("/ml_data_instance_metrics"
                         "/{}/epes_of_adjusted_distortion_fields".format(phase))
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    epes_of_adjust_distortion_fields = h5pywrappers.dataset.load(**kwargs)



    hdf5_dataset_path = "num_{}_mini_batches_per_epoch".format(phase)
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    num_mini_batches_per_epoch = h5pywrappers.dataset.load(**kwargs)[()].item()

    

    num_ml_data_instances_in_ml_dataset = \
        (num_mini_batches_per_epoch * mini_batch_size)
    num_ml_data_instances_in_ml_dataset = \
        min(num_ml_data_instances_in_ml_dataset,
            epes_of_adjust_distortion_fields.size)



    start = 0
    stop = (epes_of_adjust_distortion_fields.size
            - (epes_of_adjust_distortion_fields.size
               % num_ml_data_instances_in_ml_dataset))
    single_dim_slice = slice(start, stop)

    epes_to_analyze = \
        epes_of_adjust_distortion_fields[single_dim_slice]
    epes_to_analyze_grouped_by_epoch = \
        epes_to_analyze.reshape(-1, num_ml_data_instances_in_ml_dataset)
    
    quantiles = np.quantile(epes_to_analyze_grouped_by_epoch, 
                            q=(0.25, 0.5, 0.75), 
                            axis=1)
    y = quantiles[1]
    y_err = quantiles[[0, 2], :]
    y_err[0] = y-y_err[0]
    y_err[1] = y_err[1]-y
    x = np.arange(y.size)



    kwargs = {"x": x,
              "y": y,
              "yerr": y_err,
              "label": phase,
              "marker": "o",
              "markersize": 3,
              "color": phase_to_color_map[phase],
              "mfc": phase_to_color_map[phase],
              "ecolor": (phase_to_color_map[phase], 0.3)}
    ax.errorbar(**kwargs)



legend_label_font_size = axis_label_font_size
ax.legend(loc="upper right", 
          fontsize=legend_label_font_size)

ax.set_xlabel("epoch", 
              fontsize=axis_label_font_size)
ax.set_ylabel("EPE of adjusted distortion\nfield (image width)", 
              fontsize=axis_label_font_size)

ax.set_yscale('log')

for spatial_dim in ("x", "y"):        
    kwargs = {"axis": spatial_dim,
              "which": "major",
              "direction": "in",
              "left": True,
              "right": True, 
              "width": major_tick_width, 
              "length": major_tick_length, 
              "labelsize": tick_label_size}
    ax.tick_params(**kwargs)

    kwargs["which"] = "minor"
    kwargs["width"] = minor_tick_width
    kwargs["length"] = minor_tick_length
    ax.tick_params(**kwargs)

    for side in ['top','bottom','left','right']:
        ax.spines[side].set_linewidth(linewidth)

plt.tight_layout()
plt.show()

In the plot directly above, for each phase, for each epoch, the vertical line
indicates the interquartile range, and the circle/dot indicates the median. The
metric plotted directly above is described in detail in the summary
documentation of the class
[emicroml.modelling.cbed.distortion.estimation.MLModelTrainer](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.html).

Similarly to performance metrics, we can also extract the time-series of the
total mini-batch loss, which we plot below:

In [None]:
fig, ax = plt.subplots()



for phase in phase_to_color_map:
    hdf5_dataset_path = "num_{}_mini_batches_per_epoch".format(phase)
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    num_mini_batches_per_epoch = h5pywrappers.dataset.load(**kwargs)[()]



    hdf5_dataset_path = ("/mini_batch_losses/{}/total".format(phase))
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    total_mini_batch_losses = h5pywrappers.dataset.load(**kwargs)[()]


    
    start = 0
    stop = (total_mini_batch_losses.size
            - (total_mini_batch_losses.size % num_mini_batches_per_epoch))
    single_dim_slice = slice(start, stop)

    losses_to_analyze = \
        total_mini_batch_losses[single_dim_slice]
    losses_to_analyze_grouped_by_epoch = \
        losses_to_analyze.reshape(-1, num_mini_batches_per_epoch)
    
    quantiles = np.quantile(losses_to_analyze_grouped_by_epoch, 
                            q=(0.25, 0.5, 0.75), 
                            axis=1)
    y = quantiles[1]
    y_err = quantiles[[0, 2], :]
    y_err[0] = y-y_err[0]
    y_err[1] = y_err[1]-y
    x = np.arange(y.size)



    kwargs = {"x": x,
              "y": y,
              "yerr": y_err,
              "label": phase,
              "marker": "o",
              "markersize": 3,
              "color": phase_to_color_map[phase],
              "mfc": phase_to_color_map[phase],
              "ecolor": (phase_to_color_map[phase], 0.3)}
    ax.errorbar(**kwargs)



legend_label_font_size = axis_label_font_size
ax.legend(loc="upper right", 
          fontsize=legend_label_font_size)

ax.set_xlabel("epoch", 
              fontsize=axis_label_font_size)
ax.set_ylabel("total mini-batch loss (image width)", 
              fontsize=axis_label_font_size)

ax.set_yscale('log')

for spatial_dim in ("x", "y"):        
    kwargs = {"axis": spatial_dim,
              "which": "major",
              "direction": "in",
              "left": True,
              "right": True, 
              "width": major_tick_width, 
              "length": major_tick_length, 
              "labelsize": tick_label_size}
    ax.tick_params(**kwargs)

    kwargs["which"] = "minor"
    kwargs["width"] = minor_tick_width
    kwargs["length"] = minor_tick_length
    ax.tick_params(**kwargs)

    for side in ['top','bottom','left','right']:
        ax.spines[side].set_linewidth(linewidth)

plt.tight_layout()
plt.show()

The total mini-batch loss plotted directly above is also described in detail in
the summary documentation of the class
[emicroml.modelling.cbed.distortion.estimation.MLModelTrainer](https://mrfitzpa.github.io/emicroml/_autosummary/emicroml.modelling.cbed.distortion.estimation.MLModelTrainer.html).

Let's circle back to the EPE of the predicted "adjusted" standard distortion
field, where this time we plot its cumulative distribution function (CDF) at the
end of training:

In [None]:
fig, ax = plt.subplots()



phases = ("validation", "training")



for phase in phases:
    hdf5_dataset_path = ("/ml_data_instance_metrics"
                         "/{}/epes_of_adjusted_distortion_fields".format(phase))
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    epes_of_adjust_distortion_fields = h5pywrappers.dataset.load(**kwargs)



    hdf5_dataset_path = "num_{}_mini_batches_per_epoch".format(phase)
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id, "read_only": True}
    num_mini_batches_per_epoch = h5pywrappers.dataset.load(**kwargs)[()].item()

    

    num_ml_data_instances_in_ml_dataset = \
        (num_mini_batches_per_epoch * mini_batch_size)
    num_ml_data_instances_in_ml_dataset = \
        min(num_ml_data_instances_in_ml_dataset,
            epes_of_adjust_distortion_fields.size)



    start = -num_ml_data_instances_in_ml_dataset
    stop = None
    single_dim_slice = slice(start, stop)
    multi_dim_slice = (single_dim_slice,)



    hdf5_dataset_path = ("/ml_data_instance_metrics"
                         "/{}/epes_of_adjusted_distortion_fields".format(phase))
    
    kwargs = {"filename": path_to_ml_model_training_summary_output_data,
              "path_in_file": hdf5_dataset_path}
    hdf5_dataset_id = h5pywrappers.obj.ID(**kwargs)

    kwargs = {"dataset_id": hdf5_dataset_id,
              "multi_dim_slice": multi_dim_slice}
    hdf5_datasubset_id = h5pywrappers.datasubset.ID(**kwargs)

    kwargs = {"datasubset_id": hdf5_datasubset_id}
    x = h5pywrappers.datasubset.load(**kwargs)


    
    kwargs = {"x": x,
              "bins": np.linspace(0, 0.04, 100),
              "histtype": "bar", 
              "ec": "black", 
              "cumulative": True, 
              "density": True, 
              "log": False,
              "alpha": 1,
              "color": phase_to_color_map[phase],
              "label": phase}
    ax.hist(**kwargs)



ax.legend(loc="lower right", 
          fontsize=legend_label_font_size)

ax.set_xlabel("EPE of adjusted distortion field (image width)", 
              fontsize=axis_label_font_size)
ax.set_ylabel("portion of images", 
              fontsize=axis_label_font_size)

ax.yaxis.set_major_formatter(matplotlib.ticker.PercentFormatter(1))

for spatial_dim in ("x", "y"):        
    kwargs = {"axis": spatial_dim,
              "which": "major",
              "direction": "in",
              "left": True,
              "right": True, 
              "width": major_tick_width, 
              "length": major_tick_length, 
              "labelsize": tick_label_size}
    ax.tick_params(**kwargs)

    kwargs["which"] = "minor"
    kwargs["width"] = minor_tick_width
    kwargs["length"] = minor_tick_length
    ax.tick_params(**kwargs)

    for side in ['top','bottom','left','right']:
        ax.spines[side].set_linewidth(linewidth)

plt.tight_layout()
plt.show()

## Loading and using a trained ML model ##

Let's load a trained ML model (this may take some time):

In [None]:
path_to_ml_model_state_dicts = path_to_data_dir + "/ml_models/ml_model_1"
pattern = "ml_model_at_lr_step_[0-9]*\.pth"
largest_lr_step_idx = max([name.split("_")[-1].split(".")[0]
                           for name in os.listdir(path_to_ml_model_state_dicts)
                           if re.fullmatch(pattern, name)])

ml_model_state_dict_filename = \
    (path_to_ml_model_state_dicts
     + "/ml_model_at_lr_step_{}.pth".format(largest_lr_step_idx))



module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_model_state_dict_filename": ml_model_state_dict_filename,
          "device_name": None}  # Default to CUDA device if available.
ml_model = module_alias.load_ml_model_from_file(**kwargs)

_ = ml_model.eval()

Alternatively, we can load the trained ML model as follows:

In [None]:
kwargs = {"f": ml_model_state_dict_filename,
          "map_location": torch.device('cpu'),
          "weights_only": True}
ml_model_state_dict = torch.load(**kwargs)


module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_model_state_dict": ml_model_state_dict,
          "device_name": None}
ml_model = module_alias.load_ml_model_from_state_dict(**kwargs)

_ = ml_model.eval()

With the ML model loaded, let's try estimating the distortion field in a
simulated CBED pattern. First, we need to load a simulated CBED pattern:

In [None]:
path_to_ml_dataset

In [None]:
path_to_ml_dataset = (path_to_data_dir
                      + "/ml_datasets/ml_dataset_for_validation.h5")

module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"path_to_ml_dataset": path_to_ml_dataset, 
          "entire_ml_dataset_is_to_be_cached": False, 
          "ml_data_values_are_to_be_checked": False}
ml_dataset = module_alias.MLDataset(**kwargs)



cbed_pattern_idx = 0

kwargs = \
    {"single_dim_slice": slice(cbed_pattern_idx, cbed_pattern_idx+1), 
     "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
ml_data_instances_as_signals = \
    ml_dataset.get_ml_data_instances_as_signals(**kwargs)

distorted_cbed_pattern_signal = ml_data_instances_as_signals[0]

Let's plot the distorted CBED pattern:

In [None]:
kwargs = {"axes_off": True, 
          "scalebar": False, 
          "colorbar": False, 
          "gamma": 0.2,
          "cmap": "plasma", 
          "title": ""}
distorted_cbed_pattern_signal.plot(**kwargs)

There are essentially two ways to estimate the distortion field. The first way
is as follows:

In [None]:
distorted_cbed_pattern_image = distorted_cbed_pattern_signal.data[0]
distorted_cbed_pattern_images = distorted_cbed_pattern_image[None, :, :]
ml_inputs = {"cbed_pattern_images": distorted_cbed_pattern_images}

kwargs = {"ml_inputs": ml_inputs,
          "unnormalize_normalizable_elems_of_ml_predictions": True}
ml_predictions = ml_model.make_predictions(**kwargs)



module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"ml_data_dict": ml_predictions,
          "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
distortion_models = module_alias.ml_data_dict_to_distortion_models(**kwargs)

distortion_model = distortion_models[0]

And the second way, which is more direct, is as follows:

In [None]:
kwargs = {"cbed_pattern_images": distorted_cbed_pattern_images,
          "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
distortion_models = ml_model.predict_distortion_models(**kwargs)

distortion_model = distortion_models[0]

Note that each input distorted CBED pattern must have image dimensions, in units
of pixels, equal to
``2*(ml_model.core_attrs["num_pixels_across_each_cbed_pattern"],)``. This is
because a given ML model is trained for images of fixed dimensions, in units of
pixels.

``distortion_model`` is an instance of the class
[distoptica.DistortionModel](https://mrfitzpa.github.io/distoptica/_autosummary/distoptica.DistortionModel.html). This
object stores the predicted distortion field, and can be used to perform
distortion correction. Let's visualize the predicted distortion field:

In [None]:
slice_step = 16



quiver_kwargs = {"angles": "uv",
                 "pivot": "middle",
                 "scale_units": "width"}



attr_name = "sampling_grid"
sampling_grid = getattr(distortion_model, attr_name)
sampling_grid = (sampling_grid[0].numpy(), sampling_grid[1].numpy())

X = sampling_grid[0][::slice_step, ::slice_step]
Y = sampling_grid[1][::slice_step, ::slice_step]



fig, ax = plt.subplots()

attr_name = "flow_field_of_coord_transform"
flow_field = getattr(distortion_model, attr_name)
flow_field = (flow_field[0].numpy(), flow_field[1].numpy())

U = flow_field[0][::slice_step, ::slice_step]
V = flow_field[1][::slice_step, ::slice_step]

kwargs = quiver_kwargs
ax.quiver(X, Y, U, V, **kwargs)

title_font_size = axis_label_font_size

ax.set_title("Flow Field Of Coordinate Transformation", 
             fontsize=title_font_size)
ax.set_xlabel("fractional horizontal coordinate", 
              fontsize=axis_label_font_size)
ax.set_ylabel("fractional vertical coordinate", 
              fontsize=axis_label_font_size)

for spatial_dim in ("x", "y"):
    kwargs = {"axis": spatial_dim,
              "which": "major",
              "direction": "out",
              "left": True,
              "right": True, 
              "width": major_tick_width, 
              "length": major_tick_length, 
              "labelsize": tick_label_size}
    ax.tick_params(**kwargs)

    kwargs["which"] = "minor"
    kwargs["width"] = minor_tick_width
    kwargs["length"] = minor_tick_length
    ax.tick_params(**kwargs)

    ax.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
    ax.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)

    for side in ['top','bottom','left','right']:
        ax.spines[side].set_linewidth(major_tick_width)

plt.gca().set_aspect('equal')
plt.tight_layout()
plt.show()

Let's now use ``distortion_model`` to correct the distortion in the original
distorted CBED pattern, according to the predicted distortion field:

In [None]:
kwargs = \
    {"distorted_images": distorted_cbed_pattern_image[None, None, :, :]}
undistorted_then_resampled_images = \
    distortion_model.undistort_then_resample_images(**kwargs)

undistorted_cbed_pattern_image = undistorted_then_resampled_images[0, 0]
undistorted_cbed_pattern_images = undistorted_cbed_pattern_image[None, :, :]



ml_data_dict = {"cbed_pattern_images": undistorted_cbed_pattern_images}

module_alias = \
    emicroml.modelling.cbed.distortion.estimation
kwargs = \
    {"ml_data_dict": ml_data_dict,
     "sampling_grid_dims_in_pixels": sampling_grid_dims_in_pixels}
undistorted_cbed_pattern_signals = \
    module_alias.ml_data_dict_to_signals(**kwargs)

undistorted_cbed_pattern_signal = undistorted_cbed_pattern_signals[0]

kwargs = {"axes_off": True, 
          "scalebar": False, 
          "colorbar": False, 
          "gamma": 0.2,
          "cmap": "plasma", 
          "title": ""}
undistorted_cbed_pattern_signal.inav[0].plot(**kwargs)

Note how the CBED disks are now much more circular, which indicates
qualitatively that the ML model did a reasonable job predicting the distortion
field.

Another useful instance attribute in this context is
[distoptica.DistortionModel.out_of_bounds_map_of_undistorted_then_resampled_images](https://mrfitzpa.github.io/distoptica/_autosummary/distoptica.DistortionModel.html#distoptica.DistortionModel.out_of_bounds_map_of_undistorted_then_resampled_images):

In [None]:
out_of_bounds_map_of_undistorted_then_resampled_images = \
    distortion_model.out_of_bounds_map_of_undistorted_then_resampled_images

fig, ax = plt.subplots()

ax.imshow(out_of_bounds_map_of_undistorted_then_resampled_images)

ax.axis("off")

plt.tight_layout()
plt.show()

## A note on ``fancytypes`` ##

The variables ``ml_dataset``, ``ml_model_trainer``, and ``distortion_model``,
used throughout the notebook, reference instances of subclasses of the
[fancytypes.PreSerializableAndUpdatable](https://mrfitzpa.github.io/fancytypes/_autosummary/fancytypes.PreSerializableAndUpdatable.html)
class.  One of the nice features of this class is that an instance of a subclass
thereof can be serialize straightforwardly and later reconstructed from the said
serialized representation. For example, let's serialize ``ml_dataset``:

In [None]:
serialized_rep_of_ml_dataset = ml_dataset.dumps()
serialized_rep_of_ml_dataset

Now let's reconstruct ``ml_dataset`` from its serialized representation:

In [None]:
module_alias = emicroml.modelling.cbed.distortion.estimation
kwargs = {"serialized_rep": serialized_rep_of_ml_dataset}
ml_dataset = module_alias.MLDataset.loads(**kwargs)
ml_dataset

The
[fancytypes.PreSerializableAndUpdatable](https://mrfitzpa.github.io/fancytypes/_autosummary/fancytypes.PreSerializableAndUpdatable.html)
has other nice features, which you can read about in more detail in the
documentation thereof.