# Preprocessing pipeline

## Outline

<img src="static/preprocessing_pipeline_diagram.svg">

1. __Temporal filtering__

High-frequency artefacts and slow drifts are removed with a zero-phase bandpass filter 
using mne-Python [1]. 

2. __Segmenting the data__

Epochs are non-overlapping data segments created from the continuous data with a 
given duration.
Epochs can be created from (1) events; there is a custom method that created epochs 
based on annotations in the raw data, (2) without events, data segments are created 
from the beginning of the raw data. 

3. __Outlier data rejection__  

- _Preliminar rejection_

Epochs are rejected based on a global threshold on the z-score (> 3) of the epoch 
variance and amplitude range.

- _ICA decomposition_  

The default method is the infomax algorithm, however it can be changed in the 
configuration file along with the number of components and the decimation parameter. 
Components containing blink artefacts are automatically marked with mne-Python.
The ICA sourced can be visualized and interactively selected and rejected based on 
their topographies, time-courses or frequency spectra.

- _Autoreject_  

Autoreject [2, 3] uses unsupervised learning to estimate the rejection threshold for 
the epochs. In order to reduce computation time that increases with the number of 
segments and channels, autoreject can be fitted on a representative subset of epochs 
(25% of total epochs). Once the parameters are learned, the solution can be applied to 
any data that contains channels that were used during fit.

4. __Outlier channel interpolation__

The Random Sample Consensus (RANSAC) algorithm [4] selects a random subsample of good 
channels to make predictions of each channel in small non-overlapping 4 seconds long 
time windows. It uses a method of spherical splines (Perrin et al., 1989) to 
interpolate the bad sensors.


#### References

[1] A. Gramfort, M. Luessi, E. Larson, D. Engemann, D. Strohmeier, C. Brodbeck, R. Goj, M. Jas, T. Brooks, L. Parkkonen, M. Hämäläinen, MEG and EEG data analysis with MNE-Python, Frontiers in Neuroscience, Volume 7, 2013, ISSN 1662-453X

[2] Mainak Jas, Denis Engemann, Federico Raimondo, Yousra Bekhti, and Alexandre Gramfort, “Automated rejection and repair of bad trials in MEG/EEG.” In 6th International Workshop on Pattern Recognition in Neuroimaging (PRNI), 2016.

[3] Mainak Jas, Denis Engemann, Yousra Bekhti, Federico Raimondo, and Alexandre Gramfort. 2017. “Autoreject: Automated artifact rejection for MEG and EEG data”. NeuroImage, 159, 417-429.

[4] Bigdely-Shamlo, N., Mullen, T., Kothe, C., Su, K. M., & Robbins, K. A. (2015). The PREP pipeline: standardized preprocessing for large-scale EEG analysis. Frontiers in neuroinformatics, 9, 16.



## Import packages


```%matplotlib qt``` is the recommended backend for interactive visualization (can be slower);    

switch to ```%matplotlib inline``` for faster but static plots

In [None]:
import os
from pathlib import Path

from ipyfilechooser import FileChooser
import pandas as pd

from meeg_tools.preprocessing import *
from meeg_tools.utils.epochs import create_epochs_from_intervals
from meeg_tools.utils.raw import read_raw_measurement, filter_raw, concat_raws_with_suffix
from meeg_tools.utils.log import update_log

%matplotlib qt

# Load raw data

See [this](https://mne.tools/stable/auto_tutorials/io/20_reading_eeg_data.html) documentation for help with supported file formats.  


In [None]:
# Use the widget to navigate to the experiment folder path and select an EEG file 
base_path = '/Users/weian/Downloads/Raw_data/'
fc = FileChooser(base_path)
fc.filter_pattern = ['*.vhdr', '*.edf']

display(fc)

In [None]:
# Load selected file (when the data was recorded in one piece i.e. there is only one recording in the folder)
raw = read_raw_measurement(raw_file_path=fc.selected)

## Concatenate raw data

We can use this function when there was an issue with the recording and there are multiple EEG recordings for one measurement.


In [None]:
# note that we choose a folder and NOT a file name as before
#raws_folder_path = '/Users/weian/Downloads/EEG-3/'

# with the suffix argument we specify what kind of files to look for
#raw = concat_raws_with_suffix(path_to_raw_files=raws_folder_path, suffix='.vhdr')

In [None]:
#raw.copy().crop(tmin=600, tmax=1200).plot()

## Select condition

The current logic for saving the preprocessed files is to create subfolders inside `base_path`,
with the name "preprocessed" and the name of the condition (e.g. "epochs_asrt", "epochs_rs").

In [None]:
condition = 'epochs_rs'


# Create folder for preprocessed and interim files
folder_name = 'preprocessed'
epochs_path = os.path.join(base_path, folder_name, condition)


# Create path to epoch files
if not os.path.exists(epochs_path):
    os.makedirs(epochs_path)
    
print(epochs_path)

## Temporal filtering

We apply a bandpass filter on the continuous data using the `filter_raw` function.

The default parameters can be checked with `settings['bandpass_filter']`

In [None]:
settings['bandpass_filter']

In [None]:
raw_bandpass = filter_raw(raw=raw)

## Create epochs
### B. Create epochs with a fixed duration
- not relative to stimulus onset

For this we are using the `settings['epochs']['duration']` setting.

B. 1. Epochs are created based on a stimulus interval

In [None]:
settings['epochs']['duration'] = 2

In [None]:
epochs = create_epochs_from_intervals(raw_bandpass, [(83, 84), (87, 88)])

In [None]:
epochs

## Run preprocessing


### 1.1. Preliminary epoch rejection

In [None]:
epochs_faster = prepare_epochs_for_ica(epochs=epochs)

### 1.2. Run ICA


When visualizing the components, it is recommended to subset the data (see below).


Picard can be used to solve the same problems as FastICA, Infomax, and extended Infomax, but typically converges faster than either of those methods. To make use of Picard’s speed while still obtaining the same solution as with other algorithms, you need to specify method='picard' and fit_params as a dictionary with the following combination of keys:

dict(ortho=False, extended=False) for Infomax  

dict(ortho=False, extended=True) for extended Infomax  

dict(ortho=True, extended=True) for FastICA


In [None]:
settings["ica"]

In [None]:
ica = run_ica(epochs_faster, fit_params=dict(ortho=False, extended=True))

In [None]:
# Visualize components on epochs
# Subset epochs to reduce execution time (e.g. take epochs from every 7th event)
#subset = list(epochs.event_id.keys())[::7]
# Exclude components by selecting them, right click on component name to visualize source:
ica.plot_sources(epochs_faster)

In [None]:
# Plot component topographies
ica.plot_components()


In [None]:
ica.exclude

In [None]:
# After selecting the components to exclude, apply ICA to epochs
epochs_ica = apply_ica(epochs_faster, ica)

In [None]:
print(epochs_ica.info)

### 1.4. Save cleaned epochs (recommended)

In [None]:
os.path.join(epochs_path, f'{epochs_ica.info["temp"]}-epo.fif.gz')

In [None]:
epochs_ica.save(os.path.join(epochs_path, f'{epochs_ica.info["temp"]}-epo.fif.gz'),
                overwrite=True)

### 1.5. Create a log file 

We can create a log file for the preprocessed data and store metadata
that could be useful to remember. You can add more columns to this, or 
remove the ones that are not needed. For documentation purporses, it is 
recommended to store the number of rejected and total epochs, the number of
ICA components that were rejected, the number of interpolated electrodes etc.
You can also add a column with "notes" to add custom descriptions about the data.

In [None]:
notes = ''

In [None]:
settings["log"] = "Your name"

In [None]:
update_log(epochs_path, epochs_ica, notes)

### 2.1. Run autoreject

In [None]:
reject_log = run_autoreject(epochs_ica, subset=False)

In [None]:
reject_log.report

In [None]:
# Here you can decide how strict should be the epoch rejection.
# You can drop only those that were marked as bad epochs, or a more 
# strict rejection threshold can be if you drop epochs where more than
# 15% of the channels were marked as noisy.

# You can plot the epochs with Autoreject, where bad epochs are marked with
# red colors. 

#reject_log.plot_epochs(epochs_faster)

In [None]:
epochs_autoreject = apply_autoreject(epochs=epochs_ica, reject_log=reject_log)

In [None]:
os.path.join(epochs_path, f'{epochs_autoreject.info["temp"]}-epo.fif.gz')

In [None]:
epochs_autoreject.save(os.path.join(epochs_path, f'{epochs_autoreject.info["temp"]}-epo.fif.gz'), overwrite=True)

In [None]:
# Update log
notes = ''

update_log(epochs_path, epochs_autoreject, notes)

### 3. Find and interpolate bad channels

In [None]:
bads = get_noisy_channels(epochs=epochs_autoreject, with_ransac=True)

In [None]:
#bads.extend(['T7', 'CPz'])

# .append() for string e.g. 'F7'
# .extend() for list ['F7', 'F8']

In [None]:
bads

In [None]:
epochs_ransac = interpolate_bad_channels(epochs=epochs_autoreject, bads=bads)

In [None]:
print(epochs_ransac.info)

## 4. Final visual inspection

Mark epochs that should be dropped,  etc.

In [None]:
# # use indexing to plot fewer epochs (faster) e.g. [::7] shows only every 7th epoch
epochs_ransac.plot(n_epochs=10,
                       n_channels=32,
                # group_by='position',
                       scalings={'eeg': 20e-6})

### 5.2. Set average reference

To set a “virtual reference” that is the average of all channels, you can use set_eeg_reference() with ref_channels='average'.


In [None]:
epochs_ransac.set_eeg_reference('average')

## 6. Save cleaned epochs

In [None]:
os.path.join(epochs_path, f'{epochs_ransac.info["temp"]}-epo.fif.gz')

In [None]:
epochs_ransac.save(os.path.join(epochs_path, f'{epochs_ransac.info["temp"]}-epo.fif.gz'), overwrite=True)

In [None]:
update_log(epochs_path, epochs_ransac, '')