## Preprocess widefield calcium imaging data using Spark
This notebook demonstrates how to read binary raw data files stored on UZH Swift object storage into a Spark RDD, convert it into a Numpy array and perform preprocessing to generate a DFF array. Both the raw data and DFF arrays are stored as output HDF5 files on the Swift object storage.

### Imports

In [None]:
# Import Python modules
import os, sys
import numpy as np
from matplotlib import pylab as plt
from __future__ import print_function
import getpass
import h5py
import tempfile
import shutil

%matplotlib inline

# the notebook backend: 'local' or 'openstack'
nbBackend = 'openstack'

# add folder 'utils' to the Python path
# this folder contains custom written code that is required for data import and analysis
utils_dir = os.path.join(os.getcwd(), 'utils')
sys.path.append(utils_dir)

In [None]:
# Import custom-written modules
import SwiftStorageUtils
import WidefieldDataUtils as wf
import PickleUtils as pick
import CalciumAnalysisUtils as calciumTools

### File paths and directories

In [None]:
# start of name for matching files
filename_start = '20152310_' # all files with names starting like this will be processed

# swift file system
swift_container = 'ariel' # specify name of container in Swift (do not use _ etc. in container names!)
swift_provider = 'SparkTest' # in general, this should not change

# derive the Swift base URI
swift_basename = "swift://" + swift_container + "." + swift_provider + "/"

In [None]:
# OpenStack credentials for accessing Swift storage
os_username = 'hluetc'
os_tenant_name = 'helmchen.hifo.uzh'
os_auth_url = 'https://cloud.s3it.uzh.ch:5000/v2.0'
# provide OS password
os_password = getpass.getpass()

In [None]:
# put all these parameters in a dictionary, so that we can pass them conveniently to functions
file_params = dict()
file_params['filename_start'] = filename_start
file_params['swift_container'] = swift_container
file_params['swift_provider'] = swift_provider
file_params['swift_basename'] = swift_basename
file_params['os_username'] = os_username
file_params['os_tenant_name'] = os_tenant_name
file_params['os_auth_url'] = os_auth_url
file_params['os_password'] = os_password

### Experiment parameters

In [None]:
# dimensions and number of frames of input data
dims = (512,512)
timepoints = 200

# image dimensions for analysis (aspect ratio MUST be preserved)
dims_analysis = (256,256) # use None or dims to skip resizing

if not dims_analysis:
    dims_analysis = dims

# time vector and trial times
sample_rate = 20.0 # Hz
t = (np.array(range(timepoints)) / sample_rate) - 3.0

t_stim = -1.9 # stimulus cue (auditory)
t_textIn = 0 # texture in (i.e. stimulus onset)
t_textOut = 2 # texture starting to move out (stimulus offset)
t_response = 4.9 # response cue for licking (auditory)
t_base = -2 # baseline end (for F0 calculation)

### Analysis parameters

In [None]:
bg_smooth = 30 # SD of Gaussian smoothing kernel for background estimation (in pixel) 

seg_cutoff = 0.0002 # Segmentation threshold; larger value = bigger mask; 
# smaller value = smaller mask (i.e. more pixels ignored); suggested = 0.0002

# Frames for F0 calculation
f0_frames = t<t_base # F0 as time before baseline

f0_frames[:] = False
f0_frames[9:12] = True # F0 as certain specified frames

### Start SparkContext

In [None]:
from setupSpark import initSpark
# Initialize Spark
# specify the number of cores and the memory of the workers
# each worker VM has 8 cores and 32 GB of memory
# the status of the cluster (ie. how many cores are available) can be checked in the Spark UI:
# http://SparkMasterIP:8080/

spark_instances = 2 # the number of workers to be used
executor_cores = 8 # the number of cores to be used on each worker
executor_memory = '28G' # the amount of memory to be used on each worker
max_cores = 16 # the max. number of cores Spark is allowed to use overall

# returns the SparkContext object 'sc' which tells Spark how to access the cluster
sc = initSpark(nbBackend, spark_instances=spark_instances, executor_cores=executor_cores, \
               max_cores=max_cores, executor_memory=executor_memory)

# from pyspark import SparkFiles, StorageLevel

In [None]:
# provide OpenStack credentials to the Spark Hadoop configuration
sc._jsc.hadoopConfiguration().set('fs.swift.service.SparkTest.username', os_username)
sc._jsc.hadoopConfiguration().set('fs.swift.service.SparkTest.tenant', os_tenant_name)
sc._jsc.hadoopConfiguration().set('fs.swift.service.SparkTest.password', os_password)

In [None]:
# add Python files in 'utils' folder to the SparkContext 
# this is required so that all files are available on all the cluster workers
for filename in os.listdir(utils_dir):
    if filename.endswith('.py'):
        sc.addPyFile(os.path.join(utils_dir, filename))

### Load files into RDD

In [None]:
# create Spark RDD with all objects in the Swift container
file_rdd = sc.binaryFiles(file_params['swift_basename'])
# filter relevant files
file_rdd = file_rdd.filter(lambda (k,v): file_params['filename_start'] in k)

The elements in file_rdd are key-value pairs, where the key is the file name and the value is the file's byte stream.

In [None]:
# use count() to access every element in the RDD
file_rdd.count()

### Convert byte-stream to movie
First, we define a function that specifies how the data should be read from a file. Then we perform an RDD transformation (map) that instructs Spark to pass the values of each element through the defined function. We also repartition the RDD to have as many partitions as number of cores. Note that RDD transfomations are executed only once they are actually needed ('lazy execution'). In this case, this will happen only when we want to return the first element.

In [None]:
def convertDCAMtoMov(byte_stream):
    """
    Convert raw DCAM byte-stream to movie. 
    
    Note that parameters (e.g. image dimensions) are provided as global variables in the notebook.
    """
    byte_stream = byte_stream[234:] # 234 bytes is the offset
    A = np.fromstring(byte_stream, dtype=np.uint16)
    A = A[:dims[0]*dims[1]*timepoints] # remove data points at the end
    
    # re-arrange data into the correct shape
    mov = np.fliplr(A.reshape([dims[0], dims[1], timepoints], order='F'))
    # hack to remove strange pixels with very high intensity
    mov[np.where(mov > 60000)] = 0
    
    # resize to analysis dimensions
    mov = wf.resizeMovie(mov, resolution=dims_analysis, interp='bilinear')
    
    return mov

In [None]:
# convert byte-stream to movie
mov_rdd = file_rdd.map(lambda (k,v): (k, convertDCAMtoMov(v))).repartition(max_cores) # TODO: preserve keys
# persist caches the RDD for faster access; for large RDDs, this may use a lot of memory
# mov_rdd.persist()

In [None]:
# get first movie (return key-value tuple)
mov1 = mov_rdd.first()

To check if the data has been imported correctly, display some frames as images.

In [None]:
path, file_id = os.path.split(mov1[0])
print('File: %s' % (file_id))
dat = mov1[1]
xy = (dat.shape[0]/1.05, dat.shape[1] - (dat.shape[1]/1.1))
f, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(dat[:,:,0], cmap='gray', interpolation='none')
axes[0].annotate('Frame %1.0f' % 0, xy=xy, fontsize=14, color='yellow', horizontalalignment='right')
axes[1].imshow(np.nanmean(dat, axis=2), cmap='gray', interpolation='none')
axes[1].annotate('Mean', xy=xy, fontsize=14, color='yellow', horizontalalignment='right')
axes[2].imshow(np.nanmax(dat, axis=2), cmap='gray', interpolation='none')
axes[2].annotate('Max', xy=xy, fontsize=14, color='yellow', horizontalalignment='right')

### Preprocess movie
The preprocessing pipeline currently consists of 3 steps: estimation and subtraction of background, segmentation of area of interest, normalization (dF/F calculation). As for conversion, we first define a function that is then applied to the Spark RDD. These transformations are only registered, not executed.

In [None]:
def preprocMovie(mov, dims=dims, timepoints=timepoints, bg_smooth=bg_smooth, seg_cutoff=seg_cutoff):
    """
    Perform preprocessing steps for a movie. 
    """
    
    # estimate background signal intensity
    print('Estimating background', end="")
    bg_estimate = wf.estimateBackground(mov[:,:,0], bg_smooth)
    print(' - Done (%1.2f)' % bg_estimate)
    
    # subtract the background (set negative to 0)
    mov = mov - bg_estimate
    mov[mov<0] = 0
    
    # segment out the background (set to np.nan)
    print('Segmenting background', end="")
    mov = wf.segmentBackground(mov, seg_cutoff, plot=False)
    print(' - Done')
    
    # baseline normalization (Dff)
    print('Calculating Dff', end="")
    dff = calciumTools.calculateDff(mov , f0_frames)
    print(' - Done')
    
    return dff

In [None]:
# apply transformation to the RDD
dff_rdd = mov_rdd.map(lambda (k,v): (k, preprocMovie(v)))

### Save data as HDF5 files
Now we can save the data back to the Swift storage. This will finally kick-off the whole processing pipeline that has been defined so far.

In [None]:
# Set the names for the output folders
output_folder_mov = 'mov_out'
output_folder_dff = 'dff_out'

Check if the folders exist already. If a folder exists, will display the contents and ask for confirmation to delete.

In [None]:
from SwiftStorageUtils import deleteExistingFolder
deleteExistingFolder(swift_container, output_folder_mov, file_params)
deleteExistingFolder(swift_container, output_folder_dff, file_params)

In [None]:
def getFileNameFromKey(key):
    """
    Return the file name from the RDD key (i.e. split of the swift URL)
    """
    path, name = os.path.split(key)
    return name

In [None]:
# Save the image data as HDF5 on Swift storage. 
# This will run all the transformations that have been registered for mov_rdd.
from SwiftStorageUtils import saveAsH5
mov_rdd.foreach(lambda (k,v): (k, saveAsH5(v, getFileNameFromKey(k), 'mov', output_folder_mov, file_params)))

In [None]:
# Save the dFF data as HDF5 on Swift storage. 
# This will run all the transformations that have been registered for dff_rdd.
dff_rdd.foreach(lambda (k,v): (k, saveAsH5(v, getFileNameFromKey(k), 'dff', output_folder_dff, file_params)))

### Save RDD as pickle file (do NOT use for now!)

In [None]:
# save mov_rdd as pickle file on Swift
# mov_rdd.saveAsPickleFile('%s%s' % (file_params['swift_basename'], output_folder_mov))

In [None]:
# Sanity check: load RDD and compare with original
# mov_rdd_copy_swift = sc.pickleFile('swift://ariel.SparkTest/mov_out')
# np.array_equal(mov_rdd.first()[1], mov_rdd_copy_swift.first()[1])

In [None]:
# save dff_rdd as pickle file on Swift
# dff_rdd.saveAsPickleFile('%s%s' % (file_params['swift_basename'], output_folder_dff))