# Intro to 4D-STEM data: Load and Basic Preprocessing

Here we:
- load a 4D datacube from a .dm4 file
- examine the datacube and its calibrations
- filter hot pixels
- bin in diffraction space
- crop in real space
- save



### Acknowledgements

This tutorial was created by the py4DSTEM instructor team:
- Ben Savitzky (bhsavitzky@lbl.gov)
- Steve Zeltmann (steven.zeltmann@berkeley.edu)
- Stephanie Ribet (sribet@u.northwestern.edu)
- Alex Rakowski (arakowski@lbl.gov)
- Colin Ophus (clophus@lbl.gov)


Updated 07/10/2023

## Set up the environment

In [1]:
import py4DSTEM

py4DSTEM.__version__

'0.14.2'

---
# Download the tutorial data <a class="anchor" id="part_00"></a>

You can download the tutorial dataset here: (200 megabytes)
* [Small .dm4 vacuum probe file](https://drive.google.com/file/d/1QTcSKzZjHZd1fDimSI_q9_WsAU25NIXe/view?usp=drive_link)

# Load data

In [2]:
# Set the filepath - please change the string below to the location of the data on your system

filepath_data = "/Users/Ben/work/data/py4DSTEM_sampleData/vacuum_probe_20x20.dm4"

In [3]:
# Load a datacube from a .dm4 file

# py4DSTEM uses `import_file` for loading file formats other than it's own,
# and uses `read` for native files, i.e. files that py4DSTEM wrote

datacube = py4DSTEM.import_file(
    filepath_data,
)

In [4]:
# Let's look at datacube by just passing it directly to the Python interpreter and seeing
# what it spits out

datacube

DataCube( A 4-dimensional array of shape (20, 20, 512, 512) called 'dm_dataset',
          with dimensions:

          Rx = [0.0,1.0,...] pixels
          Ry = [0.0,1.0,...] pixels
          Qx = [0.0,0.0046968888491392136,...] A^-1
          Qy = [0.0,0.0046968888491392136,...] A^-1
)

In [5]:
# This tells us that our datacube is a 4D array with shape (20,20,512,512), and that:


# 'Real space', or the plane of the sample, has a shape of (20,20), meaning the electron beam
# was rastered over a 20x20 grid, and 

# 'Diffraction space' or reciprocal space, or the plane of the detector, has a shape of (512,512),
# meaning the scattered electron intensities are represented in a 512x512 grid.


# In py4DSTEM we use 'R' for real space and 'Q' for diffraction space.
# Another common convention is to use 'K' for diffraction space.

In [6]:
# If we have a large dataset that we know we'll want to bin - or even a dataset that's
# too large to fit in our computer's memory but that *would* fit, if only it were a little
# bit smaller - we can bin the data in diffraction space as it's loaded.  Note that this
# is currently only supported for .dm files.  If you'd like bin-on-load functionality for
# another format, let us know by filing an issue!

# Notice that the diffraction pixel size is now 4 times larger than in the unbinned data above

datacube_binned = py4DSTEM.import_file(
    filepath_data,
    binfactor = 4
)

datacube_binned

100%|█████████████████████████████████████████████| 400/400 [00:00<00:00, 1394.05it/s]


DataCube( A 4-dimensional array of shape (20, 20, 128, 128) called 'dm_dataset',
          with dimensions:

          Rx = [0.0,1.0,...] pixels
          Ry = [0.0,1.0,...] pixels
          Qx = [0.0,0.018787555396556854,...] A^-1
          Qy = [0.0,0.018787555396556854,...] A^-1
)

In [7]:
# We're going to work with `datacube` from here, so let's delete `datacube_binned`

del(datacube_binned)

In [8]:
# The data itself lives here

datacube.data

array([[[[ 0,  0,  0, ...,  0,  0,  2],
         [ 0,  4,  0, ...,  0,  0,  8],
         [ 0,  0,  0, ...,  8,  1,  3],
         ...,
         [ 0, 22,  0, ..., 10, 13,  2],
         [ 2,  6,  6, ...,  1,  0,  5],
         [ 0,  0,  0, ...,  4,  3,  0]],

        [[ 0,  0,  0, ...,  0,  6,  0],
         [ 0,  3,  0, ..., 11,  0,  6],
         [ 0,  5,  0, ...,  3,  0,  7],
         ...,
         [ 0,  1,  0, ...,  0, 14,  0],
         [ 4,  0,  0, ...,  1,  0,  0],
         [ 0,  0,  0, ..., 10,  0,  0]],

        [[ 2,  0,  0, ...,  0,  0,  0],
         [ 0,  5,  5, ...,  6,  0,  0],
         [ 0,  5,  0, ...,  4,  8,  0],
         ...,
         [ 0,  3,  1, ...,  0,  5,  0],
         [ 0, 10,  0, ...,  0,  0,  0],
         [ 0,  8,  1, ...,  5,  2,  0]],

        ...,

        [[ 0,  0,  8, ...,  0,  0,  0],
         [ 5,  6,  0, ...,  8,  0, 15],
         [ 0,  3,  0, ..., 12,  5,  0],
         ...,
         [ 0,  6,  2, ...,  0,  5,  0],
         [ 0,  0,  0, ...,  0,  4,  3],
    

In [9]:
# A few more shape properties -

print(datacube.data.shape)
print(datacube.shape)
print(datacube.Rshape)
print(datacube.Qshape)

(20, 20, 512, 512)
(20, 20, 512, 512)
(20, 20)
(512, 512)


In [10]:
# Vectors which calibrate each dimension of the dataset are included as a
# part of the datacube, using any calibrations retrieved from the file

# dimension vectors -
print('The first dimension:')
print(f'dimension name: {datacube.dim_names[0]}')
print(f'dimension units: {datacube.dim_units[0]}')
print(datacube.dims[0])
print()
print('The third dimension:')
print(f'dimension name: {datacube.dim_names[2]}')
print(f'dimension units: {datacube.dim_units[2]}')
print(datacube.dims[2][:10])  # note the `[:10]` - we're only displaying the first 10 entries

print()

# pixel sizes -
qpix = datacube.calibration.get_Q_pixel_size()
qpixunit = datacube.calibration.get_Q_pixel_units()
print(f"The diffraction space pixels are each {qpix:.4f} {qpixunit}")
rpix = datacube.calibration.get_R_pixel_size()
rpixunit = datacube.calibration.get_R_pixel_units()
print(f"The real space pixels are each {rpix:.4f} {rpixunit}")

The first dimension:
dimension name: Rx
dimension units: pixels
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19.]

The third dimension:
dimension name: Qx
dimension units: A^-1
[0.         0.00469689 0.00939378 0.01409067 0.01878756 0.02348444
 0.02818133 0.03287822 0.03757511 0.042272  ]

The diffraction space pixels are each 0.0047 A^-1
The real space pixels are each 1.0000 pixels


In [11]:
# A complete list of calibrations live here
# The vectors above are derived from these values

datacube.calibration

Calibration( A Metadata instance called 'calibration', containing the following fields:

             Q_pixel_size:    0.0046968888491392136
             R_pixel_size:    1.0
             Q_pixel_units:   A^-1
             R_pixel_units:   pixels
)

In [12]:
# Right now the real space pixel size is listed as 1 pixel, telling us that this
# information was not available or was not scraped from the .dm4 file.

# Let's say we know that the real space pixel size between beam positions was 5 nanometers.
# We can set a new value with:

datacube.calibration.set_R_pixel_size(5)
datacube.calibration.set_R_pixel_units('nm')

datacube.calibration

Calibration( A Metadata instance called 'calibration', containing the following fields:

             Q_pixel_size:    0.0046968888491392136
             R_pixel_size:    5
             Q_pixel_units:   A^-1
             R_pixel_units:   nm
)

In [13]:
# The appropriate values are automatically updated in the datacube:

# dimension vectors
print('The first dimension:')
print(f'dimension name: {datacube.dim_names[0]}')
print(f'dimension units: {datacube.dim_units[0]}')
print(datacube.dims[0])
print()

# pixel sizes
rpix = datacube.calibration.get_R_pixel_size()
rpixunit = datacube.calibration.get_R_pixel_units()
print(f"The real space pixels are each {rpix:.4f} {rpixunit}")

The first dimension:
dimension name: Rx
dimension units: nm
[ 0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95]

The real space pixels are each 5.0000 nm


# Filter hot pixels

In [14]:
# This is a simple function for finding and removing hot pixels, i.e. too-bright pixels
# whose intensity comes from an artifact like a stray x-ray or detector error rather than
# electron scattering.

# Pixels in the mean diffraction image which are `thresh` times brighter than any other
# pixel in their local neighborhood, and replaces those pixels in each diffraction
# pattern by their local median.

datacube.filter_hot_pixels(
    thresh = 8
)

Cleaning pixels: 100%|███████████████████████| 400/400 [00:00<00:00, 6904.89 images/s]


DataCube( A 4-dimensional array of shape (20, 20, 512, 512) called 'dm_dataset',
          with dimensions:

          Rx = [0,5,...] nm
          Ry = [0,5,...] nm
          Qx = [0.0,0.0046968888491392136,...] A^-1
          Qy = [0.0,0.0046968888491392136,...] A^-1
)

# Bin in diffraction space

In [15]:
# Bin the data in diffraction space

# Loading an unbinned datacube and then binning later may be preferable to binning on-load
# in some cases, e.g. if preprocessing using unbinned data is desired.

datacube.bin_Q(4)

DataCube( A 4-dimensional array of shape (20, 20, 128, 128) called 'dm_dataset',
          with dimensions:

          Rx = [0,5,...] nm
          Ry = [0,5,...] nm
          Qx = [0.0,0.018787555396556854,...] A^-1
          Qy = [0.0,0.018787555396556854,...] A^-1
)

In [16]:
# Examine the updated calibrations and dimension vectors

# dimension vectors -
print('The first dimension:')
print(f'dimension name: {datacube.dim_names[0]}')
print(f'dimension units: {datacube.dim_units[0]}')
print(datacube.dims[0])
print()
print('The third dimension:')
print(f'dimension name: {datacube.dim_names[2]}')
print(f'dimension units: {datacube.dim_units[2]}')
print(datacube.dims[2][:10])  # note the `[:10]` - we're only displaying the first 10 entries

print()

# pixel sizes -
qpix = datacube.calibration.get_Q_pixel_size()
qpixunit = datacube.calibration.get_Q_pixel_units()
print(f"The diffraction space pixels are each {qpix:.4f} {qpixunit}")
rpix = datacube.calibration.get_R_pixel_size()
rpixunit = datacube.calibration.get_R_pixel_units()
print(f"The real space pixels are each {rpix:.4f} {rpixunit}")

The first dimension:
dimension name: Rx
dimension units: nm
[ 0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95]

The third dimension:
dimension name: Qx
dimension units: A^-1
[0.         0.01878756 0.03757511 0.05636267 0.07515022 0.09393778
 0.11272533 0.13151289 0.15030044 0.169088  ]

The diffraction space pixels are each 0.0188 A^-1
The real space pixels are each 5.0000 nm


# Write and read

In [17]:
# Generally its worth being a little thoughtful about whether or not to re-write
# a datacube into computer storage. Because they tend to be large, and we already
# have access to the datacube from the original microscope file, it's best not to
# write the datacube to a new file unless there is a good reason to.

# That said, sometimes you'll have such a reason!  Common cases where you might want
# to do this are if you've done significant downsampling such that the new datacube
# is much smaller than the original, or if you've done time consuming pre-processing
# which you don't want to repeat for each analysis you do on this dataset.

In [18]:
# Set a filepath

# Here we're going to use the name and location of our original dataset, after
# removing it's file extension and appending some info to the end to indicate
# how we've modified it

from os.path import splitext
filepath_save = splitext(filepath_data)[0] + '_preprocessed_filtered_bin4.h5'

print(filepath_save)

/Users/Ben/work/data/py4DSTEM_sampleData/vacuum_probe_20x20_preprocessed_filtered_bin4.h5


In [20]:
# Save

py4DSTEM.save(
    filepath_save,
    datacube,
    #mode = 'o'    # 'overwrite' mode
)

In [21]:
# Inspect the resulting HDF5 file

# This lets us look at what data lives in some native py4DSTEM .h5 file without needing
# to open and read the whole thing

py4DSTEM.print_h5_tree(filepath_save)

/
|---dm_dataset_root
    |---dm_dataset




In [22]:
# So what are 'dm_dataset' and 'dm_dataset_root'?

# 'dm_dataset' is the default name that was given to our datacube when
# we loaded it from a .dm file.  You can always modify this by re-assigning the
# datacube.name attribute

# 'dm_dataset_root' is just a holder for 'dm_dataset'.  

datacube.name

'dm_dataset'

In [23]:
# Read the file

d = py4DSTEM.read(
    filepath_save,
)

In [24]:
# Tada!  Our datacube has been loaded from the new .h5 file

d

DataCube( A 4-dimensional array of shape (20, 20, 128, 128) called 'dm_dataset',
          with dimensions:

          Rx = [0,5,...] nm
          Ry = [0,5,...] nm
          Qx = [0.0,0.018787555396556854,...] A^-1
          Qy = [0.0,0.018787555396556854,...] A^-1
)

In [25]:
d.calibration

Calibration( A Metadata instance called 'calibration', containing the following fields:

             Q_pixel_size:     0.018787555396556854
             R_pixel_size:     5
             Q_pixel_units:    A^-1
             R_pixel_units:    nm
             _root_treepath:   
)

In [26]:
d.dims

(array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
        85, 90, 95]),
 array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
        85, 90, 95]),
 array([0.        , 0.01878756, 0.03757511, 0.05636267, 0.07515022,
        0.09393778, 0.11272533, 0.13151289, 0.15030044, 0.169088  ,
        0.18787555, 0.20666311, 0.22545066, 0.24423822, 0.26302578,
        0.28181333, 0.30060089, 0.31938844, 0.338176  , 0.35696355,
        0.37575111, 0.39453866, 0.41332622, 0.43211377, 0.45090133,
        0.46968888, 0.48847644, 0.507264  , 0.52605155, 0.54483911,
        0.56362666, 0.58241422, 0.60120177, 0.61998933, 0.63877688,
        0.65756444, 0.67635199, 0.69513955, 0.71392711, 0.73271466,
        0.75150222, 0.77028977, 0.78907733, 0.80786488, 0.82665244,
        0.84543999, 0.86422755, 0.8830151 , 0.90180266, 0.92059021,
        0.93937777, 0.95816533, 0.97695288, 0.99574044, 1.01452799,
        1.03331555, 1.0521031 , 1.07089066, 1.08967821, 