In [1]:
# for developers
%load_ext autoreload
%autoreload 2

In [3]:
# import hdf5plugin # required to access LZ4-encoded HDF5 data sets, if not on your global path
import matplotlib.pyplot as plt
from diffractem import version, proc2d, pre_proc_opts, io
from diffractem.dataset import Dataset
from tifffile import imread
import numpy as np
from dask.distributed import Client, LocalCluster, TimeoutError
import os
import pandas as pd

%matplotlib widget

opts = pre_proc_opts.PreProcOpts('preproc.yaml')
# opts.im_exc = 'indexamajig'
cfver = !{opts.im_exc} -v
print(f'Running on diffractem:', version())
print(f'Running on', cfver[0])
print(f'Current path is:', os.getcwd())

pxmask=imread(opts.pxmask)
reference=imread(opts.reference)

Running on diffractem: v0.3.4-39-g3db9f6f
Running on CrystFEL: 0.9.1+f937b91c
Current path is: /nas/localdata/30388/serialed/serialed-examples


# Pre-processing of SerialED data
...from raw diffraction data in _NeXus_ format files to cleaned, sorted, selected, and corrected diffraction data, and accurate information about peak and pattern center positions for further steps. This comprises:
* Aggregation of dose fractionation movies
* Center and peak finding in diffraction patterns
* Hit selection
* Export of peak data files for indexing
* Flat-field, dead-pixel and saturation correction; optionally background subtraction
* Broadcasting of results to single fractionation frames, and different arbitrary cumulations

The central tool are `diffractem.Dataset` objects, which handle file I/O, metadata, subset selection, fast computations, and much more.
Also, the image processing functions in `diffractem.proc2d` are essential. 
By internally using the _dask_ framework, Datasets can be larger than memory, computations are lazy (i.e., executed only just when needed) and done in parallel, mostly scaling directly with the number of available cores.

## Initialize _dask.distributed_ cluster
This initializes the computation backend. If there is no dask.distributed scheduler running at the specified port (usually 8786), a new one will be created. Make sure that the number of workers and threads make sense. It's a good idea to set it to your workstation's CPU configuration, and explicitly set a fast scratch drive as `local_directory`. In the output of the cell you will find the link for the dashboard of the scheduler, where you can follow the computation progress (NB: if you're connecting to Jupyter through a SSH tunnel, you'll need to open an additional tunnel for the port of the dashboard).

In [3]:
cluster_port = 8786

try:
    client = Client(address=f'127.0.0.1:{cluster_port}', timeout='2s')
    print('Running cluster scheduler found and connected.')
    client.run(os.chdir, os.getcwd()); # change the cluster to the current directory
except (OSError, TimeoutError):
    print('Seems no cluster scheduler is running. Starting one.')
    cluster = LocalCluster(host=f'127.0.0.1:{cluster_port}', n_workers=20, threads_per_worker=2, 
                       local_directory='/scratch/distributed')
    client = Client(address=f'127.0.0.1:{cluster_port}')

client

Seems no cluster scheduler is running. Starting one.


0,1
Client  Scheduler: tcp://127.0.0.1:8786  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 20  Cores: 40  Memory: 540.64 GB


## Load the raw data set
While the files could just be loaded using a wildcard expression in `Dataset.from_files`, it's a good idea to first get an explicit file list using `io.expand_files`, which you can filter in case you find that some files are troublemakers. This list can explicitly be used as input to `Dataset.from_files`.

An import aspect now is the _chunking_ of the Dataset, which defines the size of blocks of the dataset that are loaded into memory and processed separately and in parallel. If you did not use dose fractionation, a value around 20 is a good idea. If you did use dose fractionation, it should be larger, and a multiple of the number of frames per crystal. Read more about chunks in the _dask_ documentation.

In this data set we had 25 frames per movie, so we pick `chunking=50`.

In [4]:
opts.load() # re-load parameters from the .yaml file

raw_files = io.expand_files('raw_data/*.nxs', validate=True)
print(f'Found {len(raw_files)} raw files. Have fun pre-processing!')
ds = Dataset.from_files(raw_files, chunking=50, )
ds

Found 34 raw files. Have fun pre-processing!
Persisting stacks to memory: 


diffractem Dataset object spanning 34 NeXus/HDF5 files
-----
55800 shots (55800 selected)
2247 features
1 data stacks: raw_counts
Diffraction data stack: raw_counts
Data files open: True
Data files writable: False

### Merge global metadata into shot list
Here, we only merge the shutter time of each movie frame (as they varied between the regions in this set)

In [5]:
ds.merge_meta('/%/instrument/detector/collection/shutter_time')

## Aggregation/selection
Now, shots that correspond to auxiliary scan points have to be rejected (or based on other criteria in the shot list), and, if dose fractionation was used, movies should be summed over some reasonable range of frames before further processing. (In the final steps, we can work on the separate frames again.)

Now is a good time to have a first look at the (aggregated) dataset using `tools.viewing_widget`. Note that no data is written to disk yet, all operations are done lazily, i.e. just computed in real time when you need it.

In [6]:
ds_agg = ds.aggregate(query='frame >= 1 and frame <= 4 and shutter_time == 2', 
                      by=['sample', 'region', 'run', 'crystal_id'], how='sum', new_folder='proc_data')
print(f'Have {len(ds_agg.shots)} shots for processing.')

Monotonous aggregation: True 
File/subset remixing: False
Frame aggregation: True
Acq. run aggregation: False
Discarding shot table columns: ['Event', 'frame', 'shot_in_subset']
Persisting stacks to memory: 
Have 2146 shots for processing.


#### Have a first look
Fire up the viewing widget, with some reasonable default parameters.
Log is helpful for unprocessed sets.

In [6]:
%matplotlib widget
ds_agg.view(Imax=4000, log=True)

  ih.set_data(np.log10(img_stack[shot,...].compute(scheduler='single-threaded')))


VBox(children=(HBox(children=(VBox(children=(Textarea(value='sample: Lyso190304\nregion: 3\nrun: 0\ncrystal_id…

## Optimize the peak & center finding pipeline
`get_pattern_info` analyzes diffraction patterns (with whatever steps you like - define them in `preproc.yaml`) and returns the results as a pandas DataFrame and a dictionary containing found peak positions in CXI formats. For now, it finds the center of mass (COM), fits a Lorentz function to the central region, finds the peaks, and refines the center position using Friedel mate matching.

Here, you can test this pipeline on a small subset of your dataset and check if the peak and center finding work reliably. If not then modify your `preproc.yaml` file parameters, especially those for peak finding under `peak_search_params`.
Keep in mind that a too large number of false positive peaks will confuse the indexer horribly.

For displaying, consider the `log` checkbox and set `Imax` to something high, so you can properly see if the center is well matched.
Then browse a bit through the shots and check the quality of the image annotations.

In [38]:
ds_sample = ds_agg.get_random_subset(N=30, seed=1000)
ds_sample.compute_pattern_info('preproc.yaml', client=client, output_file=None)
ds_sample.view(Imax=4000, log=True)

## Run center & peak finding on full data set
Now that you've found the optimal parameters, the same thing is run on the entire data set. We also store a file `image_info.h5`, which is a valid diffractem-type (NeXus) HDF5 file, which contains the metadata and peak/pattern center positions, but no image data. It is useful as a backup or to make export files for indexers. 

In the second cell, the results are weaved into our data set object. Optionally, you can instead load the data from an existing `image_info.h5` (or similar) file.

Note also, that from this point on you can start using `process_peaks.ipynb` to optimize your geometry and unit cell.

In [9]:
ds_agg.compute_pattern_info(opts='preproc.yaml', client=client)

Starting computation... view detailed dashboard at http://127.0.0.1:8787/status
Wrote analysis results to file image_info.h5 100% Completed | 31.6s


In [32]:
ds_agg.merge_pattern_info(ds_from='image_info.h5')

Single-file dataset, disabling parallel I/O.
No feature list in data set ('/%/map/features not found in image_info.h5.'). That's ok if it's a virtual or info file.
Persisting stacks to memory: nPeaks, peakTotalIntensity, peakXPosRaw, peakYPosRaw
Persisting stacks to memory: nPeaks, peakXPosRaw, peakYPosRaw, peakTotalIntensity


## Hit selection
define a criterion to select hits, accordsing to the `selection` query string, then show random sample images to get an idea if it made sense. It often makes sense to not only look at the total `num_peaks`, but also `num_lores_peaks` and `frac_lores_peaks`. The resolution limit for what is considered lores is defined as `lores_limit` in inverse nanometres.

Afterwards, you can look (again) at your hit-selected data set `ds_hit` to see if it contains hits only.

In [11]:
plt.close('all')
selection = 'num_peaks > 15'

ds_agg.shots['hit'] = ds_agg.shots.eval(selection)
ds_hit = ds_agg.get_selection('hit', file_suffix='_hit.h5')

ds_hit.view(Imax=50)

1322 shots out of 2146 selected.
Persisting stacks to memory: nPeaks, peakTotalIntensity, peakXPosRaw, peakYPosRaw


VBox(children=(HBox(children=(VBox(children=(Textarea(value='sample: Lyso190304\nregion: 3\nrun: 0\ncrystal_id…

## Creation of actual processed images
...using the `proc2d.correct_image` function, which applies corrections as defined in the `YAML` file. The stack with corrected images (dask arrays, so they are not actually computed yet!) is added to the dataset.

Then, you can use the viewing widget to assess if the correction matches your expectations. If not, change the options file, and re-iterate until you're happy. (NB: The update of the widget might take slightly longer than before, as the correction is done in real time.)

Finally, run the `compute_and_save` method, which actually computes all corrected patterns and writes them to disk. Note that using the `exclude_stacks='raw_counts'` will prevent the raw data from being written into the new files. This function might take quite a while. Please follow the progress on the dask dashboard. Note that if you follow the standard workflow, this will be the first time that diffraction data is actually _written_ to disk (besides the virtual one for indexing).

In [14]:
ds_compute.view(shot=402)

VBox(children=(HBox(children=(VBox(children=(Textarea(value='sample: Lyso190304\nregion: 24\nrun: 0\ncrystal_i…

In [15]:
# run the computation. Depending on your computer and data set size, have a coffee or go to bed now.
ds_compute.compute_and_save(diff_stack_label='corrected', list_file='hits_agg.lst', exclude_stacks='raw_counts',
                            client=client, overwrite=True)

Initializing data files...
Storing meta tables...


  warn(f'Non-matching primary diffraction stack labels: '


Storing meta stacks nPeaks, peakTotalIntensity, peakXPosRaw, peakYPosRaw
[########################################] | 100% Completed |  7.8s
Storing diffraction data stack corrected... monitor progress at http://127.0.0.1:8787/status (or forward port if remote)
Initializing data sets for diffraction stack corrected...
Submitting tasks to dask.distributed scheduler...
Starting computation...


## Next steps
Now, you have a set of processed diffraction data files as listed in `hits_agg.lst`, derived from summing (aggregating) over a large range of dose fractionation frames.
Those files also contain the found peaks and beam centers which you will require during cell refinement, indexing and integration in `proc_peaks.ipynb` and `indexing.ipynb`; independently those metadata are also written into `image_info.h5`.
In `dose_fractionation.ipynb`, it is shown how to prepare similar data files with different frame aggregation ranges, in order to optimize your data for minimum radiation damage.