# MEG data loading function examples

This notebook demonstrates the use of the data loading function for MEG data (`load_MEG_dataset`).

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../code')
from load_data import load_MEG_dataset

2022-05-17 15:45:16.062305: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-17 15:45:16.062495: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Add code directory to path

This is necessary because the code for the data loading function is not in the same directory as the notebook.

In [2]:
import sys
sys.path.append('../code')

### Load the function

The function `load_MEG_dataset` downloads, loads, and preprocesses the data.

In [3]:
from load_data import load_MEG_dataset

In [4]:
?load_MEG_dataset

[0;31mSignature:[0m
[0mload_MEG_dataset[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0msubject_ids[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmode[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'individual'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_format[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'numpy'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrial_data_format[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'2D'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_location[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'./data/'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenter_timepoint[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwindow_width[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;34m[[0m[0;34m-[0m[0;36m5[0m[0;34m,[0m [0;36m6[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0

### Basic usage

The function can load in data for any number of subjects, which can be specified in the `subject_ids` argument using a list of subject IDs.

In [5]:
# Single subject
X, y = load_MEG_dataset(['001'])

# Two subjects
X, y = load_MEG_dataset(['001', '002'])

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------


### Data format

The function can return the data as either a Numpy array (more useful for e.g. scikit-learn) or a batched Tensorflow dataset (more useful for neural networks).

The data format is specified using the `output_format` parameter.

If `output_format` is `'numpy'`, the function returns a tuple of two Numpy arrays: `(data, labels)`.

In [7]:
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy')
print(X.shape)  # (1350, 272, 11)
print(y.shape)  # (1350, )

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
(1350, 272, 11)
(1350,)


If the output format is `'tf'`, the function returns a batched Tensorflow dataset.

In [8]:
ds = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='tf')
ds

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------


2022-05-17 15:49:06.922913: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-17 15:49:06.923250: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-17 15:49:06.923468: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (DESKTOP-KC0FGA6): /proc/driver/nvidia/version does not exist
2022-05-17 15:49:06.925669: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<BatchDataset element_spec={'image': TensorSpec(shape=(None, 272, 11), dtype=tf.float64, name=None), 'label': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}>

### Dealing with multiple subjects

Data for multiple subjects can be returned in three different ways, depending on the `mode` parameter.

1.  `mode='individual'`: The function returns a list of datasets, one for each subject.
2.  `mode='concatenate'`: The function returns a single dataset, where the data and labels are concatenated across subjects.
3.  `mode='stack'`: The function returns a single dataset, where the data and labels are stacked across subjects in an additional dimension.

In [9]:
# Individual - a list of numpy arrays
X, y = load_MEG_dataset(['001', '002'], mode='individual', output_format='numpy')
print(len(X), len(y))  # Lists with two elements
print(X[0].shape, y[0].shape)  # (675, 272, 11) (675,)

# Concatenate - a single numpy array
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy')
print(X.shape, y.shape)  # (1350, 272, 11) (1350, 2)

# Stack - a single numpy array, with subjects represented in the first dimension
X, y = load_MEG_dataset(['001', '002'], mode='stack', output_format='numpy')
print(X.shape, y.shape)  # (2, 675, 272, 11) (2, 675)

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
2 2
(675, 272, 11) (675,)
Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
(1350, 272, 11) (1350,)
Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
(2, 675, 272, 11) (2, 675)


### Trial data format

The data for each trial can be returned as either a flattened 1D array or a 2D array with shape `(n_channels, n_samples)`. This can be set using the `trial_format` parameter.

In [10]:
# 1D format
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', trial_data_format='1D')
print(X.shape, y.shape)  # (1350, 2992) (1350, 2)

# 2D format
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', trial_data_format='2D')
print(X.shape, y.shape)  # (1350, 272, 11) (1350, 2)

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
(1350, 2992) (1350,)
Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
(1350, 272, 11) (1350,)


### Principal component analysis (PCA)

The function can also run PCA on the data. This can be done using the `pca_n_components` parameter - if this is set to `None` (the default), no PCA is performed. Otherwise, PCA is performed with the number of components specified.

In [11]:
# PCA with 30 components
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', trial_data_format='2D', pca_n_components=30)
print(X.shape)  # (1350, 30, 11)

Loading subject 001
Data loaded
Running PCA
PCA complete
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Running PCA
PCA complete
Subject 002 complete
--------------------------------------
(1350, 30, 11)


### Setting the timepoints to use

The center timepoint used for the data window can be set using the `center_timepoint` parameter. This is the sample index **after stimulus onset** that will be center of the data window. The default is 20 (i.e., 200ms after stimulus onset).

The width of the window used for data extraction is specified using the `window_width` parameter. This is the number of samples before and after the center timepoint that will be included in the data window. This is specified as a tuple of the form (`samples before`, `samples after`). Note that `samples_after` must be specified as 1 greater than the actual number of samples as this includes the center of the window. The default is (-5, 6), i.e. 5 samples before and 5 samples after the center timepoint.

In [12]:
# With a window of 10 samples before and after
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', trial_data_format='2D', pca_n_components=30, window_width=(-10, 11))
print(X.shape)  # (1350, 30, 21)

Loading subject 001
Data loaded
Running PCA
PCA complete
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Running PCA
PCA complete
Subject 002 complete
--------------------------------------
(1350, 30, 21)


### Shuffling the data

The function can also shuffle the trials. This can be done using the `shuffle` parameter.

In [13]:
X, y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', trial_data_format='2D', shuffle=True)

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------


AttributeError: 'builtin_function_or_method' object has no attribute 'shuffle'

### Training and test sets

The function can provide training and test data sets. The relative size of these datasets can be specified using the `train_test_split` parameter, which specifies the proportion of data to be used for training (by default, this is set to 0.75).

The function can return either the training or test data set. This can be specified using the `training` parameter. If `training` is `True`, the function returns the training data set. If `training` is `False`, the function returns the test data set. Note that because the randomness is consistent, based on the `seed` parameter, the same training and test data set will be returned for each call to the function unless the `seed` parameter is changed.

In [None]:
train_X, train_y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy')
test_X, test_y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy')

print(train_X.shape, train_y.shape)  # (1350, 272, 11) (1350, 2)
print(test_X.shape, test_y.shape)  # (450, 272, 11) (450, 2)

### Batching

If the data is returned as a batched Tensorflow dataset, the data can be batched using the `batch_size` parameter to determine the size of each batch. This batches across trials, so each batch contains a subset of trials.

In [None]:
ds = test_X, test_y = load_MEG_dataset(['001', '002'], mode='concatenate', output_format='numpy', batch_size=32)

Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
Loading subject 001
Data loaded
Subject 001 complete
--------------------------------------
Loading subject 002
Data loaded
Subject 002 complete
--------------------------------------
