In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import matplotlib.pyplot as plt
from diffractem import tools, version
from diffractem.dataset import Dataset
from diffractem.stream_parser import StreamParser, augment_stream
from diffractem import pre_proc_opts
import numpy as np
import pandas as pd
import dask.array as da
# from dask.distributed import Client, LocalCluster
import dask
# import h5py

opts = pre_proc_opts.PreProcOpts('preproc.yaml')
cfver = !{opts.im_exc} -v
print(cfver)

['CrystFEL: 0.9.1+886ae521', 'License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.', 'This is free software: you are free to change and redistribute it.', 'There is NO WARRANTY, to the extent permitted by law.', '', 'Written by Thomas White and others.']


# Indexing and Integration
...using _CrystFEL's_ `indexamajig` tool and several wrappers around it.
What you need to begin:
* a virtual-geometry data file, which you should have created during preprocessing. It contains, first and foremost, all necessary information about the beam center and Bragg peak positions.
* a refined unit-cell file, which you can generate using `peak_processing.ipynb`, and good geometry settings in your `.yaml` config file. If unsure about ellipticity, double check using `peak_processing.ipynb`.

First, we define the list of shot list fields which should go into the output stream file of indexing. (See `indexamajig --copy-hdf5-filed`).
Here, we only use the really crucial ones, without which the stream file will be hard to use later on.

In [2]:
stream_fields = ['frame', 'sample', 'region', 'crystal_id', 'run', 
                '_Event', '_file', 'center_x', 'center_y'] 

# filter to only take the ones that are actually present
ds_ctr = Dataset.from_files('virtual.h5', open_stacks=True, chunking=-1)
stream_fields = [f'/%/shots/{f}' for f in  stream_fields if f in ds_ctr.shots.columns]

# generate geometry file for virtual geometry from yaml file parameters.
opts.load()
tools.make_geometry(opts, 'virtual.geom', image_name='zero_image', xsize=1024, ysize=1024, mask=False)

Single-file dataset, disabling parallel I/O.
No feature list in data set ('/%/map/features not found in virtual.h5.'). That's ok if it's a virtual or info file.
Persisting stacks to memory: index, nPeaks, peakTotalIntensity, peakXPosRaw, peakYPosRaw


### Direct local execution
...generates a shell script `im_run.sh` containing the CrystFEL call, to directly run on this machine, using a number of processes defined in the `procs` argument.
All parameters for indexing are set in the `preproc.yaml` file.

In [4]:
opts.load() # often reload the opts so they remain updated
tools.call_indexamajig('virtual.lst', 'virtual.geom', script='im_run.sh', 
                       output='virtual.stream',  cell='refined.cell', im_params=opts.indexing_params, 
                       copy_fields=stream_fields, procs=40)

In [5]:
# for the curious cats
!cat im_run.sh

indexamajig -g virtual.geom -i virtual.lst -o virtual.stream -j 40 -p refined.cell --indexing=pinkIndexer --integration=rings-nograd-nocen --int-radius=3,4,6 --peaks=cxi --hdf5-peaks=/entry/data --no-revalidate --max-res=400 --pinkIndexer-considered-peaks-count=4 --pinkIndexer-angle-resolution=4 --pinkIndexer-refinement-type=5 --pinkIndexer-thread-count=1 --pinkIndexer-tolerance=0.1 --pinkIndexer-reflection-radius=0.001 --pinkIndexer-max-resolution-for-indexing=2 --min-peaks=15 --no-refine --no-retry --no-check-peaks --temp-dir=/scratch/diffractem --copy-hdf5-field=/%/shots/sample --copy-hdf5-field=/%/shots/region --copy-hdf5-field=/%/shots/crystal_id --copy-hdf5-field=/%/shots/run --copy-hdf5-field=/%/shots/_Event --copy-hdf5-field=/%/shots/_file --copy-hdf5-field=/%/shots/center_x --copy-hdf5-field=/%/shots/center_y

### Version for clusters
...which splits up the patterns into sections of `shot_per_run`, and generates a script file that submits them independently to a SLURM queue manager. Similar to CrystFEL's `turbo-index-slurm`, but a bit more streamlined. All required files for indexing can be optionally packed into a `.tar.gz` file, which can be uploaded to a cluster right away and run there.

Here, `procs` defines the number of parallel processes with which a chunk of `shots_per_run` shots is processed; additionally `threads` can be defined, which are used by _pinkIndexer_. Vs `procs`, this is especially useful to save memory.

Here it is important, that the `exc` argument gets the path to the `indexamajig` executable on your cluster.

In [6]:
tar, script = tools.call_indexamajig_slurm('virtual.lst', 'virtual.geom', name='lyso_idx', cell='refined.cell',
                             im_params=opts.indexing_params, procs=4, threads=2, shots_per_run=50,
                             tar_file='virtual.tar.gz', temp_dir='$TMP_LOCAL', copy_fields=stream_fields,
                             exc='$HOME/SHARED/EDIFF/software/crystfel9/bin/indexamajig', 
                                           local_bin_dir='/opts/crystfel_master/bin')

Wrote self-contained tar file lyso_idx.tar.gz. Upload to your favorite cluster and extract with: tar -xf lyso_idx.tar.gz
Run indexing by calling ./im_run_lyso_idx.sh


#### Template for sending to/receiving from a cluster

In [7]:
# upload immediately to your cluster
# remote = 'rbuecke1@login.gwdg.de:~/SHARED/EDIFF/lyso_redo'
# !ssh {remote.split(":")[0]} 'mkdir -p {remote.split(":")[1]}'
# !scp {tar} {remote}

lyso_idx.tar.gz                               100% 1814KB   1.8MB/s   00:00    


In [19]:
# concat streams on server and transfer back
# name = 'lyso_idx'
# cmd = f'ssh {remote.split(":")[0]} \"cat {remote.split(":")[1]}/partitions/*.stream > {remote.split(":")[1]}/virtual.stream\"'
# !{cmd}
# !scp -r {remote}/virtual.stream .

virtual.stream                                100%  121MB 121.0MB/s   00:01    


## Integration
Now we have the file `virtual.stream`, which contains our indexing solution!
We now need to run `indexamajig` a second time, this time on our actual data and using `indexing=file`.
The data we will use can either be the one we generated `virtual.stream` from (that is, `hits_agg.lst`, or another aggregation (or even single shots) if we prepared them in `preprocessing.ipynb` and merged the image info into it.
We do the latter, and use the files with aggregation from shots 0 to 2 (5 ms effective exposure time), listed in `hits_0to2.lst`.
This way, instead of running a fresh indexing, it will take a _solution file_ (`.sol`), which contains per line:
* The filename and CrystFEL event identifier of an indexed crystal (2 parameters)
* The reciprocal lattice vectors in laboratory frame (9 parameters)
* The shift of the detector for that pattern (2 parameters). This is particularly important, as here we can inject the variable beam center of our datasets, on top of the (much smaller) residual shift that a prediction refinement after index might have found
Of course this file is generated automatically.

But first we have to make a geometry file, using our optimized geometry parameters (including ellipticity refinement from `proc_peaks.ipynb`).
All required parameters are in `preproc.yaml`.

In [8]:
# make the final geometry
opts = pre_proc_opts.PreProcOpts('preproc.yaml')
geo = tools.make_geometry(opts,'refined.geom', write_mask=True)

#### Solution file from dataset
This is usual the better (if a bit slower version) compared to that belo.
Here, a Dataset object is loaded from disk.
Now, the stored crystal identification data for each shot in the Dataset (i.e.: `sample`, `region`, `run`, `crystal_id`) are used for matching.
You can hence now integrate even from a totally different set of patterns (e.g. a different aggregation range, or even a set with all non-aggregated data - the crystal ID data will just repeat for each frame).

The solution should have the same name as the `.lst` file, which is inherently assumed by the `file` indexer.

In [20]:
dsname = 'hits_0to2'
ds_all = Dataset.from_files(dsname + '.lst', open_stacks=False)
ds_all.get_indexing_solution('virtual.stream', sol_file=dsname + '.sol')

Unnamed: 0,file,Event,astar_x,astar_y,astar_z,bstar_x,bstar_y,bstar_z,cstar_x,cstar_y,cstar_z,xshift,yshift
2,proc_data/LysoS1_001_00000_0to2_hit.h5,entry//2,-0.123068,-0.030390,0.012760,-0.024792,0.116623,0.043024,-0.047231,0.080319,-0.247748,2.493316,-1.534916
3,proc_data/LysoS1_001_00000_0to2_hit.h5,entry//3,-0.125631,0.019530,-0.009787,-0.020153,-0.124636,0.008312,-0.012829,0.022624,0.259966,3.664206,-1.495367
4,proc_data/LysoS1_001_00000_0to2_hit.h5,entry//4,0.049782,0.106199,-0.049654,-0.116626,0.050179,-0.007018,0.027380,0.097864,0.241653,2.342227,-1.224425
5,proc_data/LysoS1_001_00000_0to2_hit.h5,entry//5,0.120050,-0.042516,0.003282,-0.040612,-0.101054,0.065185,-0.036354,-0.130464,-0.227013,3.814784,-0.979225
7,proc_data/LysoS1_001_00000_0to2_hit.h5,entry//7,-0.042433,0.115496,-0.029929,0.118534,0.035017,-0.030068,-0.042062,-0.078318,-0.247584,2.700214,-0.748242
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1316,proc_data/LysoS2_046_00000_0to2_hit.h5,entry//29,-0.039397,0.121351,0.001466,0.119805,0.040379,0.000129,-0.001939,0.006021,-0.260452,1.697644,1.652947
1317,proc_data/LysoS2_046_00000_0to2_hit.h5,entry//30,0.052757,0.115157,0.000294,0.109487,-0.050028,0.037872,0.073660,-0.035056,-0.248480,1.359185,1.982266
1318,proc_data/LysoS2_046_00000_0to2_hit.h5,entry//31,0.004924,0.086296,0.092767,0.006753,0.092807,-0.086108,-0.262811,0.017160,-0.002128,1.645963,2.809419
1319,proc_data/LysoS2_046_00000_0to2_hit.h5,entry//32,-0.016611,0.119491,0.039678,-0.125233,-0.014009,-0.007573,-0.010287,-0.082066,0.253090,1.257336,2.179325


#### Solution file directly from stream
Another option to get a `.sol` file is to run the `stream2sol` command-line tool. 
While slightly faster, this is restricted to the case that you want to integrate from the exact same images as those you used for indexing (i.e., those from `hits_agg.lst` in this example), and you have fields with mm-calibrated shifts in your stream (might not be the case).

In [None]:
cmd = tools.make_command('stream2sol', input='virtual.stream', output='hits_agg.sol',
                  event_field='hdf5/%/shots/_Event', file_field='hdf5/%/shots/_file',
                  x_shift_field='hdf5/%/shots/shift_x_mm', y_shift_field='hdf5/%/shots/shift_y_mm')
print('Running conversion command:', cmd)
!{cmd};

## Run the integration
Now we're all set to integrate the data set.
The parameters for integration are all set in the `integration_params` structure in `preproc.yaml`.
It can be well worth playing with them, especially `int-radius` and `integration`.
For the latter, we recommend to stick to `rings-nograd-nocen`, if your patterns are background-subtracted.
Otherwise `rings-grad-nocen` might work better.
Abstain from anything with `cen` in it, as it will strongly bias high-resolution peak values.
`Overpredict` might help if you plan to do merging with partiality correction (though it doesn't much in our experience), but absolutely don't do it for Monte-Carlo merging.

**Always keep `no-revalidate`, `no-retry`, `no-refine`, `no-check-cell` active.**

After you've run the command (might take a fair bit), you'll have a stream file ready for merging. See `merging.ipynb`.

In [21]:
%mkdir streams
stream_name = f'streams/{dsname}.stream'
list_file = dsname + '.lst'
copy_fields = ['sample', 'region', 'crystal_id', 'run', 
               'adf1', 'adf2', 'lor_hwhm', 'center_x', 'center_y']
tmp_dir = '/scratch/diffractem' # set to '.' if you want to stay here

opts.load()

%cp {list_file.rsplit('.', 1)[0]}.sol {tmp_dir}

copy_fields = [f'/%/shots/{cf}' for cf in copy_fields]
cfcall = tools.call_indexamajig(list_file, 'refined.geom', 
                                output=stream_name, 
                                cell='refined.cell', 
                                im_params=opts.integration_params, 
                                procs=40, exc='/opts/crystfel_hash/bin/indexamajig',
                                copy_fields=copy_fields, temp_dir=tmp_dir)

print('--- RUN THIS ---------------')
print(cfcall)

mkdir: cannot create directory ‘streams’: File exists
--- RUN THIS ---------------
/opts/crystfel_hash/bin/indexamajig -g refined.geom -i hits_0to2.lst -o streams/hits_0to2.stream -j 40 -p refined.cell --indexing=file --peaks=cxi --hdf5-peaks=/entry/data --no-revalidate --int-radius=3,4,6 --integration=rings-nograd-nocen --no-retry --no-refine --no-check-cell --temp-dir=/scratch/diffractem --copy-hdf5-field=/%/shots/sample --copy-hdf5-field=/%/shots/region --copy-hdf5-field=/%/shots/crystal_id --copy-hdf5-field=/%/shots/run --copy-hdf5-field=/%/shots/adf1 --copy-hdf5-field=/%/shots/adf2 --copy-hdf5-field=/%/shots/lor_hwhm --copy-hdf5-field=/%/shots/center_x --copy-hdf5-field=/%/shots/center_y
