### Experiment: Pre-processing

**Question**: Is it possible to train a model on unpreprocessed EEG data and still attain similar performance levels?

**Hypothesis**: The model will perform worse, but if still similar then the added value of not having to (manually) preprocess EEG data is very valuable and opens up a multitude of applications.

**Result**:

#### Part 1: Preparing data
To use hmp.utils.read_mne_data() and epoch the information, the files should be in .fif format

In [17]:
import mne
from pathlib import Path
import hsmm_mvpy as hmp
import pandas as pd
import numpy as np
import xarray as xr

In [15]:
# Set up paths and file locations
data_path = Path("/mnt/d/thesis/sat1/")
behavioral_data_path = data_path / "ExperimentData/ExperimentData"
output_path = Path("data/sat1/unpreprocessed")
output_path_data = Path("data/sat1/data_unprocessed.nc")

subj_ids = [
    subj_id.name.split("-")[1][:4] for subj_id in (data_path / "eeg4").glob("*.vhdr")
]
subj_files = [
    str(output_path / f"unprocessed_{subj_id}_epo.fif") for subj_id in subj_ids
]
behavioral_files = [
    str(behavioral_data_path / f"{subj_id}-cnv-sat3_ET.csv") for subj_id in subj_ids
]

In [None]:
# Replacing preprocessing done in https://github.com/GWeindel/hsmm_mvpy/blob/main/tutorials/sample_data/eeg/0022.ipynb
# with only the necessary (non-manual) parts, like adding metadata for processing in HMP package, more info in link above
for subject_id in subj_ids:
    print(f"Processing subject: {subject_id}")
    subject_id_short = subject_id.replace("0", "")
    raw = mne.io.read_raw_brainvision(
        data_path / "eeg4" / f"MD3-{subject_id}.vhdr", preload=False
    )
    raw.set_channel_types(
        {"EOGh": "eog", "EOGv": "eog", "A1": "misc", "A2": "misc"}
    )  # Declare type to avoid confusion with EEG channels
    raw.rename_channels({"FP1": "Fp1", "FP2": "Fp2"})  # Naming convention
    raw.set_montage("standard_1020")  # Standard 10-20 electrode montage
    raw.rename_channels({"Fp1": "FP1", "Fp2": "FP2"})

    behavioral_path = behavioral_data_path / f"{subject_id}-cnv-sat3_ET.csv"
    behavior = pd.read_csv(behavioral_path, sep=";")[
        [
            "stim",
            "resp",
            "RT",
            "cue",
            "movement",
        ]
    ]
    behavior["movement"] = behavior.apply(
        lambda row: "stim_left"
        if row["movement"] == -1
        else ("stim_right" if row["movement"] == 1 else np.nan),
        axis=1,
    )
    behavior["resp"] = behavior.apply(
        lambda row: "resp_left"
        if row["resp"] == 1
        else ("resp_right" if row["resp"] == 2 else np.nan),
        axis=1,
    )
    # Merging together the exeperimental conditions info to have the format condition/stimulus/response
    behavior["trigger"] = (
        behavior["cue"] + "/" + behavior["movement"] + "/" + behavior["resp"]
    )
    # Filtering out < 300 and > 3000 Reaction times
    behavior["RT"] = behavior.apply(
        lambda row: 0
        if row["RT"] < 300
        else (0 if row["RT"] > 3000 else float(row["RT"]) / 1000),
        axis=1,
    )
    epochs = mne.io.read_epochs_fieldtrip(
        data_path / "eeg1" / f"data{subject_id_short}.mat", info=raw.info
    )
    epochs.metadata = behavior
    epochs.save(
        output_path / f"unprocessed_{subject_id}_epo.fif", overwrite=True, verbose=False
    )  # Saving EEG mne format

In [None]:
# Run if data_unprocessed.nc does not exist
data = hmp.utils.read_mne_data(
    subj_files,
    epoched=True,
    lower_limit_RT=0.2,
    upper_limit_RT=2,
    verbose=True,
    subj_idx=subj_ids,
    rt_col="RT",
)
data.to_netcdf(output_path_data)

In [18]:
data = xr.load_dataset(output_path_data)

In [19]:
hmp_data = xr.load_dataset(Path("data/sat1/data.nc"))

In [22]:
print(data)
print(hmp_data)

<xarray.Dataset>
Dimensions:      (epochs: 200, channels: 30, samples: 993, participant: 25)
Coordinates:
  * epochs       (epochs) int64 0 1 2 3 4 5 6 7 ... 193 194 195 196 197 198 199
  * channels     (channels) object 'FP1' 'FP2' 'AFz' 'F7' ... 'CPz' 'CP2' 'CP6'
  * samples      (samples) int64 0 1 2 3 4 5 6 7 ... 986 987 988 989 990 991 992
    stim         (participant, epochs) float64 nan 1.0 1.0 1.0 ... 2.0 2.0 2.0
    resp         (participant, epochs) object '' 'resp_left' ... 'resp_left'
    RT           (participant, epochs) float64 nan 0.683 1.068 ... 0.634 1.02
    cue          (participant, epochs) object '' 'SP' 'AC' ... 'SP' 'SP' 'AC'
    movement     (participant, epochs) object '' 'stim_left' ... 'stim_right'
    trigger      (participant, epochs) object '' ... 'AC/stim_right/resp_left'
  * participant  (participant) object '0001' '0002' '0003' ... '0024' '0025'
Data variables:
    data         (participant, epochs, channels, samples) float64 nan ... nan
Attributes:
 

In [None]:
# Use information from stage_data to split data