# SPED data processing

1. [Load data](#Load-data)
2. [Tempalte matching](#Template-matching)
    1. [Build template library](#Build-the-template-library)
    2. [Indexing](#Indexing)
3. [NMF](#NMF)
4. [Clustering](#Clustering)

Some common dependencies

In [1]:
# You might have tk installed instead of qt
%matplotlib qt
import math
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import pyxem as pxm

import diffpy.structure

from transforms3d.axangles import axangle2mat
from transforms3d.euler import axangle2euler
from transforms3d.euler import euler2mat
from transforms3d.euler import mat2euler

#import warnings
# Silence some future warnings and user warnings (float64 -> uint8)
# in skimage when calling remove_background with h-dome (below)
# Should really be fixed elsewhere.
#warnings.simplefilter(action='ignore', category=FutureWarning)
#warnings.simplefilter(action='ignore', category=UserWarning)

## Load data

Load the SPED dataset. The file is lazy-loaded and then cut. This ensures that only required areas are loaded from disk to memory.

The data type is changed to float and some metadata is set. The call to `pxm.ElectronDiffraction` converts the lazy hyperspy signal to a fully loaded pyxem object which gives access to the pyxem tools. The metadata from the file has to be copied manually. The constructor probably should have done so automatically, but it does not.

In [26]:
data_dir = 'D:/Dokumenter/MTNANO/Prosjektoppgave/SPED_data_GaAs_NW/'
in_file = data_dir + 'gen/Julie_180510_SCN45_FIB_a_pyxem_sample.hdf5'
reciprocal_angstrom_per_pixel = 0.032  # Reciprocal calibration

dp = pxm.load(in_file, lazy=True)
dp = dp.inav[95:100, 30:75]

# The background removal and affine transform changes the type without
# respecting the loaded precission. We do it ourselves to be explicit.
if dp.data.dtype != 'float64':
    dp.change_dtype('float64')
    
# Convert to a pyxem ElectronDiffraction, conserve the metadata and add some more
dp_metadata = dp.metadata
dp = pxm.ElectronDiffraction(dp)
dp.data *= 1 / dp.data.max()
dp.metadata = dp_metadata
dp.set_diffraction_calibration(reciprocal_angstrom_per_pixel)

## Preprocessing

In [27]:
scale_x = 0.9954344818525674
scale_y = 1.0314371455342144
offset_x = 0.6312060246018557
offset_y = -0.3516223696556279
sigma_min = 2
sigma_max = 8

dp.apply_affine_transformation(np.array([
    [scale_x, 0, offset_x],
    [0, scale_y, offset_y],
    [0, 0, 1]
    ]))
dp = dp.remove_background('gaussian_difference', sigma_min=sigma_min, sigma_max=sigma_max)
dp.data *= 1 / dp.data.max()

HBox(children=(IntProgress(value=0, max=225), HTML(value='')))




HBox(children=(IntProgress(value=0, max=225), HTML(value='')))




## Template matching
Template matching generates a database of simulated diffraction patterns and then compares all simulated diffraction pattern to each of the experimental diffraction patterns to find the best match.

In [None]:
from pyxem.generators.indexation_generator import IndexationGenerator
from pyxem.generators.structure_library_generator import StructureLibraryGenerator
from pyxem.libraries.diffraction_library import load_DiffractionLibrary

### Build the template library

Load structure files using `diffpy`.

In [None]:
structure_zb_file = r'D:\Dokumenter\MTNANO\Prosjektoppgave\Data\Gen\NN_test_data\GaAs_mp-2534_conventional_standard.cif'
structure_wz_file = r'D:\Dokumenter\MTNANO\Prosjektoppgave\Data\Gen\NN_test_data\GaAs_mp-8883_conventional_standard.cif'

structure_zb = diffpy.structure.loadStructure(structure_zb_file)
structure_wz = diffpy.structure.loadStructure(structure_wz_file)

In [None]:
rotation_list_resolution = np.deg2rad(1)
beam_energy_keV = 200
max_excitation_error = 1/6.84  # Ångström^{-1}, extent of relrods in reciprocal space. Inverse of specimen thickness is a starting point

phase_descriptions = [('ZB', structure_zb, 'cubic'),
                      ('WZ', structure_wz, 'hexagonal')]
phase_names = [phase[0] for phase in phase_descriptions]
structure_library_generator = StructureLibraryGenerator(phase_descriptions)

Load diffraction library from file on disk or create a new one. From disk:

In [None]:
diffraction_library_cache_filename = '../../Data/Runs/tmp/GaAs_cubic_hex_1deg.pickle'
diffraction_library = load_DiffractionLibrary(diffraction_library_cache_filename, safety=True)

Or generate if from a rotation list on a stereographic triangle:

In [None]:
inplane_rotations = [np.deg2rad((103, 173)), np.deg2rad((140,))]
structure_library = structure_library_generator.get_orientations_from_stereographic_triangle(
        inplane_rotations, rotation_list_resolution)
gen = pxm.DiffractionGenerator(beam_energy_keV, max_excitation_error=max_excitation_error)
library_generator = pxm.DiffractionLibraryGenerator(gen)
target_pattern_dimension_pixels = dp.axes_manager.signal_shape[0]
half_pattern_size = target_pattern_dimension_pixels // 2
reciprocal_radius = reciprocal_angstrom_per_pixel*(half_pattern_size - 1)
diffraction_library = library_generator.get_diffraction_library(
    structure_library,
    calibration=reciprocal_angstrom_per_pixel,
    reciprocal_radius=reciprocal_radius,
    half_shape=(half_pattern_size, half_pattern_size),
    with_direct_beam=False)

Optionally, save the library for later use.

In [None]:
diffraction_library.pickle_library(diffraction_library_cache_filename)

### Indexing

Given the `diffraction_library` defined above, the `IndexationGenerator` finds the correlation between all patterns in the library and each experimental pattern, and returns the `n_largest` matches with highest correlation.

In [None]:
indexer = IndexationGenerator(dp, diffraction_library)
indexation_results = indexer.correlate(n_largest=4, keys=phase_names)

pyxem has exposes visualisations for the indexation results through a `CrystallographicMap`. Here, the phase map and orientation maps are plotted along with reliability maps. The orientation maps are not really usable directly, and should be exported to mtex for better plotting, below.

In [None]:
crystal_map = indexation_results.get_crystallographic_map()

In [None]:
crystal_map.get_phase_map().plot()
crystal_map.get_reliability_map_phase().plot()

In [None]:
crystal_map.get_orientation_map().plot()
crystal_map.get_reliability_map_orientation().plot()

mtex gives much better orientation maps, and pyxem supports exporting the orientation data in a format that can be read by mtex.

In [None]:
crystal_map.save_mtex_map(r'..\..\Data\Runs\tmp\mtex_orientation_data.csv')

Let's look at the best match. For now, single position at a time. The pyxem solution (`indexation_results.plot_best_matching_results_on_signal(dp, phase_names, diffraction_library)`) does not work for non-square datasets. This is a `Hyperspy` problem, see 
https://github.com/hyperspy/hyperspy/issues/2080. Instead, first get the matches and store them in peaks.

In [None]:
peaks = []
for indexation_result in indexation_results:
    single_match_result = indexation_result.data
    best_fit = single_match_result[np.argmax(single_match_result[:, 4])]
    phase_name = phase_names[int(best_fit[0])]
    library_entry = diffraction_library.get_library_entry(phase=phase_name, angle=(best_fit[1], best_fit[2], best_fit[3]))
    peaks.append((library_entry['pixel_coords'], library_entry['intensities'], [phase_name, *best_fit[1:4]], best_fit[4]))
peaks = np.array(peaks).reshape(dp.data.shape[0], dp.data.shape[1], 4)

Then, plot the image and write the phase and angle

In [None]:
dp.set_diffraction_calibration(reciprocal_angstrom_per_pixel)
x = 0
y = 34
plt.figure('Best fit')
plt.cla()
plt.imshow(dp.inav[x, y])
plt.scatter(peaks[y, x, 0][:, 0], peaks[y, x, 0][:, 1], marker='x', c=np.log(1 + peaks[y, x, 1]), cmap='autumn_r')
print('Best fit:', peaks[y, x, 2], 'score:', peaks[y, x, 3])

## NMF

Non-negative matrix factorisation factorises the dataset into `n_components` components that, hopefully, resemble physical diffraction patterns. This requires finding the correct number of components. It might help to study the Skree plot, but with long tails, the cut-off might not be clear. The decomposition used to create the Skree plot is not the same that is used in NMF, and NMF will not give the same separation between noise and signal, but often separates the dataset by other differences, such as bending or strain, if too many components are given. Too few components might combine similar areas.

In [4]:
dp.decomposition(normalize_poissonian_noise=True, algorithm='svd')
dp.plot_explained_variance_ratio()

<matplotlib.axes._subplots.AxesSubplot at 0x1cab13332e8>

Set the number of components and to the decomposition. `normalize_poissonian_noise=True` seems to give better results when there is noise present, but it might be worth testing without it.

In [5]:
n_components = 3

In [6]:
dp.decomposition(
        normalize_poissonian_noise=True,
        algorithm='nmf',
        output_dimension=n_components)

Hyperspy conveniently provides a function to visualise the results.

In [7]:
dp.plot_decomposition_results()

VBox(children=(HBox(children=(Label(value='Decomposition component index', layout=Layout(width='15%')), IntSli…



Optionally, we can get the decomposition data directly if we want to do some further processing on it. Here, we also normalise each loading to have a maximum value of 1 to remove a disambiguity in the decomposition.

In [8]:
# Read the results
factors = dp.get_decomposition_factors().data
loadings = dp.get_decomposition_loadings().data

# Factorization is only unique to a constant factor.
# Scale so that each loading has a maximum value of 1.
scaling = loadings.max(axis=(1, 2))  # Maximum in each component
factors *= scaling[:, np.newaxis, np.newaxis]
loadings *= np.reciprocal(scaling)[:, np.newaxis, np.newaxis]

## Clustering

Clustering is often used to group similar data points. To allow clustering, the data points (here, the diffraction patterns, represented as $\sim 10^4$ dimensions) has to be reduced to a lower-dimensional space. Here, we use the UMAP algorithm. Clustering is done with HDBSCAN. Plenty of other options exist, but the combination of dimensionality reduction and clustering is general.

In [9]:
import umap  # conda install -c conda-forge umap-learn, pip install umap-learn, or similar
import hdbscan  # conda install hdbscan, pip install hdbscan, or similar

The clustering depends on quite a few parameters, which we set here. Most important is `n_neighbours` and `cluster_min_size`. Only the UMAP embedding takes time to calculate, so the HDBSCAN paramters are easier to optimise.

In [28]:
# Random seed to get reproducible results. Set to None to get a new number each time
random_seed = 42

# Number of dimensions to reduce to before clustering. 2 allows easy visualisation, higher (~10) might give
# more accurate results.
n_dimensions = 2

# Number of nearest neighbours to check. Higher values (relative to the number of diffraction patterns)
# gives a better global clustering (position of clusters relative to each other are more representative
# of similarity), while lower values gives better positioning within clusters.
# See https://umap-learn.readthedocs.io/en/latest/parameters.html#n-neighbors
n_neighbours = 20

# Minimum distance in the embedding, [0, 1], typically 0.0 for clustering, but close to 1 allows fuzzy clustering.
# See https://umap-learn.readthedocs.io/en/latest/parameters.html#min-dist
min_dist = 0.0

# How conservative the clustering is. Larger numbers assigns more points as noise.
# See https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#selecting-min-samples
cluster_min_samples = 1

# Smallest grouping to consider a cluster.
# See https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#selecting-min-cluster-size
cluster_min_size = 20

# Reshape to a two-dimensional matrix, one row per diffraction pattern,
# as required by UMAP
data_flat = dp.data.reshape(-1, dp.axes_manager.signal_size)

The embedding can be saved (below), and later loaded to test different cluster (HDBSCAN) parameters.

In [None]:
embedding_filename = r'..\..\Data\Runs\tmp\umap_embedding'

In [None]:
embedding = np.load(embedding_filename)

If not loaded, run the projection

In [29]:
# Do the projection to a lower dimensional space (given by 'n_dimensions')
# using UMAP with the parameters specified above.
embedding = umap.UMAP(
    n_neighbors =n_neighbours,
    min_dist    =min_dist,
    n_components=n_dimensions,
    random_state=random_seed,
).fit_transform(data_flat)

  n_components


Optionally save the embedding, since this is the most expensive step.

In [None]:
np.save(embedding_filename, embedding)

Cluster the low-dimensional data using HDBSCAN and the parameters specified above.

In [30]:
clusterer = hdbscan.HDBSCAN(
    min_samples=cluster_min_samples,
    min_cluster_size=cluster_min_size,
).fit(embedding)

UMAP is working on its own visualisation tools, but for now, we can create them ourselves.

In [37]:
def plot_clustering_results(clusterer, embedding, dp, show_probability=True):
    fig, (ax_scatter, ax_phases, ax_diffraction) = plt.subplots(nrows=1, ncols=3)
    
    ax_scatter.set_title('Projection')
    color_palette = sns.color_palette(n_colors=clusterer.labels_.max() + 1)
    #color_palette = sns.color_palette('Paired', n_colors=clusterer.labels_.max() + 1)
    cluster_colors = [color_palette[l] if l >= 0
                      else (0.3, 0.3, 0.3)
                      for l in clusterer.labels_]
    cluster_member_colors = [sns.desaturate(x, p) for x, p in
                             zip(cluster_colors, clusterer.probabilities_)]
    
    ax_scatter.scatter(*embedding.T, s=30, c=cluster_member_colors,
                       alpha=0.25,
                       picker=True)
    ax_scatter.tick_params(
        axis='both',
        which='both',
        bottom=False,
        top=False,
        left=False,
        right=False,
        labelleft=False,
        labelright=False,
        labelbottom=False)

    ax_phases.set_title('Phases')
    phase_map = np.empty((dp.axes_manager.navigation_size, 3))
    
    nav_width, nav_height = dp.axes_manager.navigation_shape
    
    for i, (label, probability) in enumerate(zip(clusterer.labels_, clusterer.probabilities_)):
        cluster_color = color_palette[label]
        phase_map[i] = sns.desaturate(cluster_color, probability) if show_probability else cluster_color
    ax_phases.imshow(phase_map.reshape(nav_height, nav_width, 3), picker=True)
    
    def update_diffraction_pattern(x, y):
        ax_diffraction.set_title(
            'Diffraction pattern from {}/{}, {}/{}'.format(
                x, dp.axes_manager.navigation_shape[0],
                y, dp.axes_manager.navigation_shape[1]))
        ax_diffraction.imshow(dp.inav[x, y])
        ax_diffraction.figure.canvas.draw_idle()
        
    update_diffraction_pattern(0, 0)
    
    current_annotation = None
    def annotate(x, y, pos):
        nonlocal current_annotation
        if current_annotation is not None:
                current_annotation.remove()
        current_annotation = ax_scatter.annotate(
            '{}, {}'.format(x, y),
            pos,
            xytext=(pos[0] + 2, pos[1] + 2),
            arrowprops = {'arrowstyle': '->'})
        
    def pick_handler(event):
        if isinstance(event.artist, matplotlib.image.AxesImage):
            x = int(round(event.mouseevent.xdata))
            y = int(round(event.mouseevent.ydata))
            annotate(x, y, embedding[np.ravel_multi_index((y, x), (nav_height, nav_width))])
            update_diffraction_pattern(x, y)
        elif isinstance(event.artist, matplotlib.collections.PathCollection):
            picked_index = event.ind[0]
            x, y = np.unravel_index(picked_index, dp.axes_manager.navigation_shape)
            annotate(x, y, embedding[picked_index])
            update_diffraction_pattern(x, y)
        
    fig.canvas.mpl_connect('pick_event', pick_handler)

In [38]:
if n_dimensions == 2:
    plottable_embedding = embedding
else:
    # To allow visualisation, create an extra embedding in 2D, but keep
    # using the colours (cluster labels) from the higher-dimensional embedding
    plottable_embedding = umap.UMAP(
        n_neighbors =n_neighbours,
        min_dist    =min_dist,
        n_components=2,
        random_state=random_seed,
    ).fit_transform(data_flat)
    
plot_clustering_results(clusterer, plottable_embedding, dp, show_probability=True)

We can also construct loadings and factors from the cluster labels and probabilities returned by HDBSCAN.

In [None]:
# Allocate space for the results
label_count = clusterer.labels_.max() + 1  # include 0
cluster_factors = np.empty((label_count, signal_width, signal_height))
cluster_loadings = np.empty((label_count, nav_width, nav_height))

for label in range(label_count):
    # Set the loading from all the HDBSCAN probabilities,
    loadings[label] = clusterer.probabilities_.reshape(*dp.axes_manager.navigation_shape)
    # but mask out the results not matching this label
    mask = (clusterer.labels_ == label).reshape(*dp.axes_manager.navigation_shape)
    loadings[label][~mask] = 0.0
    # Calculate factors as a weighted average of cluster members
    # and reshape to the correct shape
    factors[label] = np.average(
        data_flat,
        weights=loadings[label].ravel(),
        axis=0).reshape(*dp.axes_manager.signal_shape)