In [8]:
%load_ext autoreload
%autoreload 2
%pwd
import hdf5plugin # required to access LZ4-encoded HDF5 data sets
import matplotlib.pyplot as plt
from diffractem import io, tools, proc_peaks, version, pre_process
from diffractem.dataset import Dataset
from diffractem.stream_parser import StreamParser
import numpy as np
import os
from glob import glob

opts = pre_process.PreProcOpts('preproc.yaml')
cfver = !{opts.im_exc} -v
print(f'Running on diffractem:', version())
print(f'Running on', cfver[0])
print(f'Current path is:', os.getcwd())

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Running on diffractem: v0.2.1-52-g0d34dce
Running on CrystFEL: 0.8.0+f9101682
Current path is: /nas/processing/serialed/GV/publication


# Indexing using PinkIndexer

### Prepare data sets
First you can prepare a file list for crystfel (just a text file with a HDF5 on each line), if you didn't already generate one in pre-processing.

Then, if you want, you can generate another file list, which just contains a small test set of patterns, to optimize the indexing. By changing `min_peaks` you can restrict it to patterns having at least a given number of detected Bragg peaks. That would be a good idea.

In [2]:
# make an input list file, in case you haven't so far
!ls proc_data/GV_*_agg_refined.h5 > aggregated.lst

In [3]:
# With this, you can prepare a sample data set to optimize indexing
min_peaks = 40
sample_size = 10

ds = Dataset.from_list('aggregated.lst', init_stacks=False, load_tables=False)
ds.stacks_to_shots('nPeaks', 'n_peaks')
shsel = ds.shots.query(f'n_peaks > {min_peaks}').sample(sample_size)
shsel[['file', 'Event']].to_csv(f'sample_{sample_size}_shot.lst',header=False, index=False, sep=' ')

  warn(f'Could not read stack {sn}')
  warn(f'Could not read stack {sn}')
  warn(f'Could not read stack {sn}')


### Cell file generation function
Consult CrystFEL documentation for how the cell files have to be defined. This is just a handy function useful to parametrically generate one, if you e.g. want to do parameter scans. Obviously, it has to be adapted to the lattice of the studied crystal.

In [4]:
def make_cell(unit_cell=102.4, cell_file='GV.cell'):
    tools.dict2file(cell_file,
        {'lattice_type': 'cubic','centering': 'I',
        'a': f'{unit_cell} A', 'b': f'{unit_cell} A', 'c': f'{unit_cell} A',
        'al': '90 deg', 'be': '90 deg', 'ga': '90 deg'},
                    header='CrystFEL unit cell file version 1.0')
    return cell_file

### Screen indexing parameters
The following cell generates a command string for indexamajig (part of CrystFEL) automatically, and also writes it into a shell script. General for indexing can be made in the preproc setting file under `indexing_params`. See `man indexamajig` for descriptions of them. They can still be overwritten. As an example for screening of indexing parameters, here we scan the input unit cell size. We use the random sub-set as defined above. After running the cell, exectue im_run.sh and have a coffee.

Once all is done, inspect the resulting stream files with edview (here: `edview.py screening/cell_scan_XX.stream`) and pick your favorite, and/or load the streams using `StreamParser` and see which ones make most sense, indexed the best etc. - for some ideas, see the cells below.

For this very case, we know that the answer is 102.4, btw.

In [5]:
opts.load()
os.makedirs('screening', exist_ok=True)
%rm screening/*

callstr = ''
for ii, cell in enumerate(np.arange(101.6, 103.6, 0.4)):
    
    # here, you can ovewrite settings from the YAML file.
    # Adjust thread count/procs to your CPU cores and screening set size.
    opts.indexing_params.update({'pinkIndexer-thread-count': 4}) 
    cell_file = make_cell(cell, f'screening/GV_{ii}.cell')
    cfcall = tools.call_indexamajig('sample_10_shot.lst', 'GV.geom', 
                                    f'screening/cell_scan_{ii}.stream', 
                                     cell=cell_file, im_params=opts.indexing_params, 
                                     procs=10, exc=opts.im_exc)
    
    callstr += cfcall + ' \n' # Add a '&' before \n if you want to run them in parallel.

open('im_run.sh','w').write(f'#!/bin/sh\n{callstr}') # write command string to a shell script
print(callstr) # or copy & paste the cell output, or run directly using "!{callstr}"

rm: cannot remove 'screening/*': No such file or directory
/opts/crystfel_eminus/bin/indexamajig -g GV.geom -i sample_10_shot.lst -o screening/cell_scan_0.stream -j 10 -p screening/GV_0.cell --indexing=pinkIndexer --min-peaks=30 --pinkIndexer-considered-peaks-count=4 --pinkIndexer-angle-resolution=3 --pinkIndexer-refinement-type=5 --pinkIndexer-thread-count=4 --pinkIndexer-tolerance=0.1 --pinkIndexer-reflection-radius=0.003 --pinkIndexer-mrfi=0.4 --fix-profile-radius=3e6 --integration=rings-grad-nocen --push-res=2 --int-radius=3,6,7 --peaks=cxi --hdf5-peaks=/entry/data --no-revalidate --max-res=600 --no-refine --no-retry --no-check-peaks --copy-hdf5-field=/%/shots/crystal_id 
/opts/crystfel_eminus/bin/indexamajig -g GV.geom -i sample_10_shot.lst -o screening/cell_scan_1.stream -j 10 -p screening/GV_1.cell --indexing=pinkIndexer --min-peaks=30 --pinkIndexer-considered-peaks-count=4 --pinkIndexer-angle-resolution=3 --pinkIndexer-refinement-type=5 --pinkIndexer-thread-count=4 --pinkIndexe

In [6]:
# only if you want to run the indexer directly in this notebook (not advised)
!{callstr}

This is what I understood your unit cell to be:
cubic I, right handed.
a      b      c            alpha   beta  gamma
101.60 101.60 101.60 A     90.00  90.00  90.00 deg
List of indexing methods:
   0: pinkIndexer-nolatt-cell   (pinkIndexer using cell parameters as prior information)
Indexing parameters:
                  Check unit cell parameters: on
                        Check peak alignment: off
                   Refine indexing solutions: off
 Multi-lattice indexing ("delete and retry"): off
                              Retry indexing: off
Waiting for the last patterns to be processed...
peak count used for indexing: 60
computed 0%
peak count used for indexing: 148
computed 0%
peak count used for indexing: 154
computed 0%
peak count used for indexing: 132
computed 0%
peak count used for indexing: 152
computed 0%
peak count used for indexing: 43
computed 0%
peak count used for indexing: 105
computed 0%
peak count used for indexing: 162
computed 0%
peak count used for indexing: 1

In [9]:
# read and collate the resulting streams.
shots = []
import pandas as pd
for fn in sorted(glob('screening/*.stream')):
    the_stream = StreamParser(fn)
    the_stream._shots['stream'] = fn.rsplit('/',1)[-1].rsplit('.',1)[0]
    shots.append(the_stream.shots)
shots = pd.concat(shots)

In [10]:
# analyze, e.g. using a pivot table. Rows are the test patterns,
# columns are the streamfile and resulting variables
# NaNs mean, that the indexing did not work on that shot
print(shots.pivot(index='serial', columns='stream', values=['a', 'c']))

# also calculate mean lattice, to get an idea in which direction to try next
ovtbl = pd.concat([shots[["a","b","c","al","be","ga","xshift","yshift"]].mean().T,
                   shots[["a","b","c","al","be","ga","xshift","yshift"]].std().T], 
                  axis=1).rename(columns={0:'mean', 1:'std'})
print('----\nLattice/geometry parameters:')
print(ovtbl)

                 a                                                  \
stream cell_scan_0 cell_scan_1 cell_scan_2 cell_scan_3 cell_scan_4   
serial                                                               
0              NaN    10.31306    10.32278    10.29012    10.40311   
1         10.24899    10.24614    10.29207    10.40864    10.45730   
2              NaN         NaN    10.27153    10.34148    10.25313   
3         10.28219    10.28041    10.41902    10.29330    10.38548   
4         10.24890    10.32012    10.35173    10.49047    10.42244   
5         10.29095    10.23797    10.34989    10.37067    10.36725   
6              NaN    10.26615    10.30410    10.35127    10.42647   
7         10.19354    10.27804    10.10549    10.28051    10.35364   
8         10.30241    10.33925    10.32433    10.33660    10.38174   
9              NaN         NaN    10.23579    10.29231    10.26236   

                 c                                                  
stream cell_scan_0 c

### Run indexing
You should have found from the screening above, that 102.4 is a good cell length. Now the actual indexing is run... so, no more loop, and the list containing all shots is used. This might take very long, so better run it over night.

Two hints to monitor the progress of indexing:
- `edview.py stream/aggregated.stream --internal`... and hit reload/last occasionally. The --internal is important in this case to avoid file locking issues.
- `less stream/aggregated.stream`... to navigate around the file

In [18]:
opts.load()

# here, you can ovewrite settings from the YAML file.
opts.indexing_params.update({'min-peaks': 30, 'integration': 'rings-grad-nocen'}) 

os.makedirs('stream', exist_ok=True)
cell_file = make_cell(102.4, 'GV.cell')
callstr = tools.call_indexamajig('aggregated.lst', 'GV.geom', 'stream/aggregated.stream', 
                                 cell=cell_file, im_params=opts.indexing_params, 
                                 procs=96, exc=opts.im_exc)

open('im_run.sh','w').write(f'#!/bin/sh\n{callstr}') # write command string to a shell script
print(callstr) # or copy & paste

/opts/crystfel_eminus/bin/indexamajig -g GV.geom -i aggregated.lst -o stream/aggregated.stream -j 96 -p GV.cell --indexing=pinkIndexer --min-peaks=30 --pinkIndexer-considered-peaks-count=4 --pinkIndexer-angle-resolution=3 --pinkIndexer-refinement-type=5 --pinkIndexer-thread-count=1 --pinkIndexer-tolerance=0.1 --pinkIndexer-reflection-radius=0.003 --pinkIndexer-mrfi=0.4 --fix-profile-radius=3e6 --integration=rings-grad-nocen --push-res=2 --int-radius=3,6,7 --peaks=cxi --hdf5-peaks=/entry/data --no-revalidate --max-res=600 --no-refine --no-retry --no-check-peaks --copy-hdf5-field=/%/shots/crystal_id


### Indexing done!
Congratulations... now let's briefly look at the results
- run `edview.py stream/aggregated.stream` for visual inspection, which is the most important check
- analyze the stream file... see cell below for some ideas. In a perfect world, the indexed fraction is close to the hit fraction, and the standard deviation of the parameters are small and close to what you expect.

In [14]:
stream = StreamParser('stream/aggregated.stream')

In [15]:
nshots = stream.shots.shape[0]
print(f'Hit fraction is {stream.shots.hit.sum()/nshots*100:.1f}%')
print(f'Indexed fraction (total) is {(stream.shots.indexed_by != "none").sum()/nshots*100:.1f}%')
ovtbl = pd.concat([stream.shots[["a","b","c","al","be","ga","xshift","yshift"]].mean().T,
                   stream.shots[["a","b","c","al","be","ga","xshift","yshift"]].std().T], 
                  axis=1).rename(columns={0:'mean', 1:'std'})
print('Lattice/geometry parameters:')
print(ovtbl)

Hit fraction is 87.1%
Indexed fraction (total) is 79.1%
Lattice/geometry parameters:
             mean       std
a       10.311454  0.065553
b       10.282641  0.052176
c       10.232567  0.076982
al      89.948959  0.551735
be      89.588325  0.655149
ga      90.159840  0.486913
xshift   0.002665  0.022375
yshift   0.000551  0.021914


### (Re-)Integration
Now you have a nice indexing solution, and you can use the resulting stream file for merging. However, you might be able to do better, by using the more sophisticated pre-processed images, which should be available by now. E.g. you can try different effective exposure times in the cumulative data sets, and see which trade-off between signal and radiation damage gives the most pleasant results.

Unfortunately, as of November 2019, the only way to achieve this with CrystFEL is by re-indexing each data set as shown here. The indexing result will always be the same, as it is derived from the peaks stored in the HDF5 files, but still take long to compute. If this is a problem for your data, for more information/hacky workarounds, please contact `robert.buecker@mpsd.mpg.de`. For this example data set, we fortunately know already, that cumulating to frame 4 gives the best result. So let's do that.

While it's running you could e.g. start playing with merging of the previous result (`aggregated.stream`), or even get a first structure from the merging result.

In [16]:
# make a list file from all patterns with frame==4
ds = Dataset.from_list('proc_data/GV_S11_*_all_nobg_cumfrom0.h5',
                      init_stacks=False, load_tables=False)
ds.shots.query('frame==4')[['file', 'Event']].to_csv('frame0to4.lst', 
                                                     header=False, index=False, sep=' ')

In [19]:
opts.load()
opts.im_exc = '/opts/crystfel_eminus/bin/indexamajig'

cell_file = make_cell(102.4, 'GV.cell')
callstr = tools.call_indexamajig('frame0to4.lst', 'GV.geom', 'stream/frame0to4.stream', 
                                 cell=cell_file, im_params=opts.integration_params, 
                                 procs=96, exc=opts.im_exc)

open('im_run.sh','w').write(f'#!/bin/sh\n{callstr}') # write command string to a shell script
print(callstr) # or copy & paste

/opts/crystfel_eminus/bin/indexamajig -g GV.geom -i frame0to4.lst -o stream/frame0to4.stream -j 96 -p GV.cell --indexing=pinkIndexer --min-peaks=30 --pinkIndexer-considered-peaks-count=4 --pinkIndexer-angle-resolution=3 --pinkIndexer-refinement-type=5 --pinkIndexer-thread-count=1 --pinkIndexer-tolerance=0.1 --pinkIndexer-reflection-radius=0.003 --pinkIndexer-mrfi=0.4 --fix-profile-radius=3e6 --integration=rings-nograd-nocen --push-res=2 --int-radius=3,6,7 --peaks=cxi --hdf5-peaks=/entry/data --no-revalidate --max-res=600 --no-refine --no-retry --no-check-peaks --copy-hdf5-field=/%/shots/crystal_id
