# Tutorial about setting up an analysis pipeline and batch processing

Quite often you experiment with various analysis routines and appropriate parameters and come up with an analysis pipeline. A pipeline procedure then is a script defining analysis steps for a single locdata object (or a single group of corresponding locdatas as for instance used in 2-color measurements).

The `Pipeline` class can be used to combine the pipeline code, metadata and analysis results in a single pickleable object (meaning it can be serialized by the python pickle module).

This pipeline might then be applied to a number of similar datasets. A batch process is such a procedure for running a pipeline over multiple locdata objects and collecting and combing results.

In [None]:
from pathlib import Path

%matplotlib inline

import matplotlib.pyplot as plt

import locan as lc

In [None]:
lc.show_versions(system=False, dependencies=False, verbose=False)

## Apply a pipeline of different analysis routines

### Load rapidSTORM data file

In [None]:
path = lc.ROOT_DIR / 'tests/test_data/rapidSTORM_dstorm_data.txt'
print(path)
dat = lc.load_rapidSTORM_file(path=path, nrows=1000)
dat.print_summary()

In [None]:
dat.properties

### Set up an analysis procedure

First define the analysis procedure (pipeline) in form of a computation function. Make sure the first parameter is the `self` refering to the Pipeline object. Add arbitrary keyword arguments thereafter. When finishing with `return self` the compute method can easily be called with instantiation. 

In [None]:
def computation(self, locdata, n_localizations_min=4):
    
    # import required modules
    from locan.analysis import LocalizationPrecision
    
    # prologue
    self.file_indicator = locdata.meta.file.path
    self.locdata = locdata
    
    # check requirements
    if len(locdata)<=n_localizations_min:
        return None
    
    # compute localization precision
    self.lp = LocalizationPrecision().compute(self.locdata)
    
    return self

### Run the analysis procedure

Instantiate a Pipeline object and run compute():

In [None]:
pipe = lc.Pipeline(computation=computation, locdata=dat, n_localizations_min=4).compute()
pipe.meta

Results are available from Pipeline object in form of attributes defined in the compute function:

In [None]:
[attr for attr in dir(pipe) if not attr.startswith('__') and not attr.endswith('__')]

In [None]:
pipe.lp.results.head()

In [None]:
pipe.lp.hist();
print(pipe.lp.distribution_statistics.parameter_dict())

You can recover the computation procedure:

In [None]:
pipe.computation_as_string()

or save it as text protocol:

The Pipeline object is pickleable and can thus be saved for revisits.

## Apply the pipeline on multiple datasets - a batch process

Let's create multiple datasets:

In [None]:
path = lc.ROOT_DIR / 'tests/test_data/rapidSTORM_dstorm_data.txt'
print(path)
dat = lc.load_rapidSTORM_file(path=path)

locdatas = [lc.select_by_condition(dat, f'{min}<index<{max}') for min, max in ((0,300), (301,600), (601,900))]
locdatas

Run the analysis pipeline as batch process

In [None]:
pipes = [lc.Pipeline(computation=computation, locdata=dat).compute() for dat in locdatas]

As long as the batch procedure runs in a single computer process, the identifier increases with every instantiation.

In [None]:
[pipe.meta.identifier for pipe in pipes]

### Visualize the combined results

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1)
for pipe in pipes:
    pipe.lp.plot(ax=ax, window=10)
plt.show()