# *tridesclous* example with locust dataset

Here a detail notebook that detail the locust dataset recodring by Christophe Pouzat.

This dataset is our classic.
It has be analyse yet by several tools in R, Python or C:
  * https://github.com/christophe-pouzat/PouzatDetorakisEuroScipy2014
  * https://github.com/christophe-pouzat/SortingABigDataSetWithPython
  * http://xtof.perso.math.cnrs.fr/locust.html

So we can compare the result.

The original datasets is here https://zenodo.org/record/21589

But we will work on a very small subset on github https://github.com/tridesclous/tridesclous_datasets/tree/master/locust


# Overview

In *tridesclous*, the spike sorting is done in several step:
  * Define the datasource and working path. (class DataIO)
  * Construct a *catalogue* (class CatalogueConstructor) on a short chunk of data (for instance 60s)
    with several sub step :
    * signal pre-processing:
      * high pass filter (optional)
      * removal of common reference (optional)
      * noise estimation (median/mad) on a small chunk
      * normalisation = robust z-score
    * peak detection
    * extract some waveform. Unecessary and impossible to extract them all.
    * find rational limit of waveforms (n_left/n_right)
    * project theses waveforms in smaller dimention (pca, ...)
    * find cluster
    * clean with GUI (class CatalogueWindow)
    * save centroids (median+mad + first and second derivative)
  * Apply the *Peeler* (class Peeler) on the long term signals. With several sub steps:
     * same signal preprocessing than before
     * find peaks
     * find the best cluster in catalogue for each peak
     * find the intersample jitter
     * remove the oversampled waveforms from the signals until there are not peaks in the signals.
     * check with GUI (class PeelerWindow)



In [1]:
%matplotlib inline

import time
import numpy as np
import matplotlib.pyplot as plt
import tridesclous as tdc

from tridesclous import DataIO, CatalogueConstructor, Peeler

This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/samuel/.virtualenvs/py36/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/samuel/.virtualenvs/py36/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/samuel/.virtualenvs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/home/samuel/.virtualenvs/py36/lib/python3.6/site-packages/tornado/pla

# Download a small dataset

trideclous provide some datasets than can be downloaded with **download_dataset**.

Note this dataset contains 2 trials in 2 different files. (the original contains more!)

Each file is considers as a *segment*. *tridesclous* automatically deal with it.

In [2]:
#download dataset
localdir, filenames, params = tdc.download_dataset(name='locust')
print(filenames)
print(params)

['/home/samuel/Documents/projet/tridesclous/example/locust/locust_trial_01.raw', '/home/samuel/Documents/projet/tridesclous/example/locust/locust_trial_02.raw']
{'dtype': 'int16', 'sample_rate': 15000.0, 'total_channel': 4}


# DataIO = define datasource and working dir


Theses 2 files are in **RawData** format this means binary format with interleaved channels.

Our dataset contains 2 segment of 28.8 second each, 4 channels. The sample rate is 15kHz.

Note that there is only one channel_group here (0).

In [3]:
#create a DataIO
import os, shutil
dirname = 'tridesclous_locust'
if os.path.exists(dirname):
    #remove is already exists
    shutil.rmtree(dirname)    
dataio = DataIO(dirname=dirname)

# feed DataIO
dataio.set_data_source(type='RawData', filenames=filenames, **params)
print(dataio)

#no need to setup the prb with dataio.set_probe_file() or dataio.download_probe()
#because it is a tetrode
    

DataIO <id: 139632851524968> 
  workdir: tridesclous_locust
  sample_rate: 15000.0
  total_channel: 4
  channel_groups: 0 [ch0 ch1 ch2 ch3]
  nb_segment: 2
  length: 431548 431548
  durations: 28.8 28.8 s.


# CatalogueConstructor

In [4]:
catalogueconstructor = CatalogueConstructor(dataio=dataio)
print(catalogueconstructor)

CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  Signal pre-processing not done yet


## Set some parameters for the pre-processing step.

For a complet description of each params see main documentation.

In [5]:
catalogueconstructor.set_preprocessor_params(chunksize=1024,
            common_ref_removal=False,
            highpass_freq=300.,
            lowpass_freq=5000.,                                             
            lostfront_chunksize=64,
            peak_sign='-',
            relative_threshold=6.5,
            peak_span=0.0001,
            )

## Estimate the median and mad of noiseon a small chunk of filtered signals.
This compute medians and mad of each channel.

In [6]:
catalogueconstructor.estimate_signals_noise(seg_num=0, duration=15.)
print(catalogueconstructor.signals_medians)
print(catalogueconstructor.signals_mads)

[1.2877347 0.8443531 1.6870663 0.5088713]
[51.053234 46.69039  57.44741  44.837955]


## Run the main loop: signal preprocessing + peak detection



In [7]:
t1 = time.perf_counter()
catalogueconstructor.run_signalprocessor(duration=60.)
t2 = time.perf_counter()

print('run_signalprocessor', t2-t1, 's')
print(catalogueconstructor)

run_signalprocessor 0.9844878860003519 s
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  cluster_labels [-11]



## extract some waveforms

Take some waveforms in the signals *n_left/n_right* must be choosen arbitrary but lon enought.
Better limits will be set later.

In [8]:
catalogueconstructor.extract_some_waveforms(n_left=-25, n_right=40, mode='rand', nb_max=10000, align_waveform=True)
print(catalogueconstructor)

compute_all_centroid 0.02803275299993402
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  some_waveforms.shape: (1323, 65, 4)
  cluster_labels [0]



# Clean waveforms

Whis try to detect bad waveforms to not include them in features aand clustering.
Strange waveforms are tag with -9 (alien)


In [9]:
catalogueconstructor.clean_waveforms(alien_value_threshold=100.)
print(catalogueconstructor)

compute_all_centroid 0.027273296000203118
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  some_waveforms.shape: (1323, 65, 4)
  cluster_labels [0]



## Find good limits for waveforms and re-extract

To avoid useless portion of signal on the sides of peaks we take smaller sweep.
This technics is based on the MAD. We take only central zone where the MAD is above the noise.
Noise is 1. In practice we take a bit more 1.1

Here the methods give a "good limts" of n_left -10 n_right 15.

So the shape of waveforms become smaller.

Note that this technic work well on tetrode or small channel number but for large array it is as good as manual.


In [10]:
n_left, n_right = catalogueconstructor.find_good_limits(mad_threshold = 1.1,)
print(catalogueconstructor)


compute_all_centroid 0.01800733700110868
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  some_waveforms.shape: (1323, 24, 4)
  cluster_labels [0]



## Project to smaller space

To reduce dimension of the waveforms (1323, 24, 4) we chosse global_pac method which is appropriate for tetrode.
It consists of flatenning some_waveforms.shape (1323, 24, 4) to (1323, 24x4) and then apply a standard PCA on it with sklearn.

Let's keep 5 component of it.

In [11]:
t1 = time.perf_counter()
catalogueconstructor.extract_some_features(method='global_pca', n_components=5)
t2 = time.perf_counter()
print('project', t2-t1)
print(catalogueconstructor)

project 0.07931107900003553
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  some_waveforms.shape: (1323, 24, 4)
  some_features.shape: (1323, 5)
  cluster_labels [0]



# find clusters

There are many option to cluster this features. here a simple one the well known kmeans method.

Unfortunatly we need to choose the number of cluster. Too bad... Let's take 12.

Later on we will be able to refine this manually.

In [12]:
t1 = time.perf_counter()
catalogueconstructor.find_clusters(method='kmeans', n_clusters=12)
t2 = time.perf_counter()
print('find_clusters', t2-t1)
print(catalogueconstructor)


compute_all_centroid 0.028010052999889012
order_clusters waveforms_rms
find_clusters 0.3653124420015956
CatalogueConstructor
  chan_grp 0 - ch0 ch1 ch2 ch3
  nb_peak_by_segment: 646, 677
  some_waveforms.shape: (1323, 24, 4)
  some_features.shape: (1323, 5)
  cluster_labels [ 0  1  2  3  4  5  6  7  8  9 10 11]



## Open CatalogueWindow for visual check

This open a CatalogueWindow, here we can check, split merge, trash, play as long as we are not happy.

We happy, we can save the catalogue.

Don't save nothing here.

In [13]:
%gui qt5
import pyqtgraph as pg
app = pg.mkQApp()
win = tdc.CatalogueWindow(catalogueconstructor)
win.show()
app.exec_()    

make_catalogue 0.02911864300040179


0

Here a snappshot of CatalogueWindow

<img src="../doc/img/snapshot_cataloguewindow.png">


# Dirty clean of catatalogue

Here a quick and dirty clean of teh catalogue and them save it!!!




In [14]:
#order cluster by waveforms rms
catalogueconstructor.order_clusters(by='waveforms_rms')

#put label 0 to trash
mask = catalogueconstructor.all_peaks['cluster_label'] == 0
catalogueconstructor.all_peaks['cluster_label'][mask] = -1
catalogueconstructor.on_new_cluster()

#save the catalogue
catalogueconstructor.make_catalogue_for_peeler()

order_clusters waveforms_rms
make_catalogue 0.02769362500112038


# Peeler

Create and run the Peeler.
It should be pretty fast, here the computation take 1.32s for 28.8x2s of signal. This is a speed up of 43 over real time.


In [15]:
initial_catalogue = dataio.load_catalogue(chan_grp=0)

peeler = Peeler(dataio)
peeler.change_params(catalogue=initial_catalogue)

t1 = time.perf_counter()
peeler.run()
t2 = time.perf_counter()
print('peeler.run', t2-t1)

print()
for seg_num in range(dataio.nb_segment):
    spikes = dataio.get_spikes(seg_num)
    print('seg_num', seg_num, 'nb_spikes', spikes.size)
    


100%|██████████| 421/421 [00:00<00:00, 525.91it/s]
100%|██████████| 421/421 [00:00<00:00, 517.98it/s]

peeler.run 1.7563410889997613

seg_num 0 nb_spikes 611
seg_num 1 nb_spikes 648





## Open PeelerWindow for visual checking

In [16]:
%gui qt5
import pyqtgraph as pg
app = pg.mkQApp()
win = tdc.PeelerWindow(dataio=dataio, catalogue=initial_catalogue)
win.show()
app.exec_()


-1

Here a snappshot of PeelerWindow

<img src="../doc/img/snapshot_peelerwindow.png">