## <a id='toc1_1_'></a>[Purpose of the notebook](#toc0_)

This is a notebook for resampling data in participant dataframes and saving it to a new, smaller dataframe with just relevant trial data. The function resamples data and returns trials cut to -1:18 seconds to comply with protocol. To get data used here, run load_and_resample.py with saving raw data enabled.

The function is structured as follows:

1. Take subset of data that doesn't include the adaptation and transition phases, only the relevant trial data (from baseline to end of post-stimulation), create a datetime index.
2. Map relevant trial variables to lists in accordance with index of trial number - this allows us to quickly re-mark the trials after resampling. The resample function doesn't deal well with strings, so a mapping like that helps with creating the resampled dataframe with all relevant info.
3. Iterate over trials.
4. Add empty rows at -1 s and +18 s in the trial - this is to ensure that all trials have the same start and length later - as we've seen in initial data exploration, some can be shorter and the -1 timepoint may not always exist. 
5. Resample the columns with pupil size for left, right and stimulated eye. The resampling aggregation method used is mean, this ignores NaN values.
6. Cut trial to -1:18 s period.
7. Fill other columns in resampled data using the mappings created in point 2.
8. Concatenate all trials into a new resampled, cleaned up dataframe.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import datetime

participant_list = [200, 201, 202, 204, 205, 206, 207, 209, 210, 211, 212, 213]

In [None]:
def resample_by_trial(data_df: pd.DataFrame, sample_freq: int = 30):
    """Function for resampling raw data.

    Args:
        data_df (pd.DataFrame): Dataframe with raw data from one participant from loading_utils.load_participant_data
        sample_freq (int, optional): Desired sampling frequency in Hz to resample data to. Defaults to 30.

    Returns:
        pd.DataFrame: DataFrame resampled to desired frequency with columns truncated to : Trial no, time Sec, time datetime, phase, type, Block, Test, Recording id, Participant id, Eye
    """
    # get time step in ns from sampling frequency provided - ns give greater precision
    time_step = np.ceil((1000 / sample_freq) * 1e6)

    # take subset of data without transition and adaptation parts (so without non-trial values)
    data_subset = data_df[(data_df["Trial no"].notna())]

    # map trial-relevant variables to trial numbers for trial marking after resampling (can't resample strings easily but we know trial data)
    trial_list = sorted(data_subset["Trial no"].unique())
    stim_list = [
        data_subset["Trial type"][data_subset["Trial no"] == i].unique()[0]
        for i in trial_list
    ]
    block_list = [
        data_subset["Block"][data_subset["Trial no"] == i].unique()[0]
        for i in trial_list
    ]
    test_list = [
        data_subset["Test"][data_subset["Trial no"] == i].unique()[0]
        for i in trial_list
    ]
    recording_list = [
        data_subset["Recording id"][data_subset["Trial no"] == i].unique()[0]
        for i in trial_list
    ]
    eye_list = [
        data_subset["Eye"][data_subset["Trial no"] == i].unique()[0] for i in trial_list
    ]
    participant = data_subset["Participant id"].unique()[0]

    # make datetime index for resampling (pandas doesn't work without it)
    data_subset.loc[:, "Trial time datetime"] = data_subset["Trial time Sec"].apply(
        lambda x: datetime.timedelta(seconds=x)
    )
    data_subset.set_index("Trial time datetime", inplace=True)

    # resample by trial and create a new dataframe
    trials_for_new_df = []

    for i, trial_no in enumerate(trial_list):
        trial = data_subset[
            [
                "Trial time Sec",
                "Stim eye - Size Mm",
                "Right - Size Mm",
                "Left - Size Mm",
            ]
        ][data_subset["Trial no"] == trial_no].copy()
        # add a row at -1s so that every trial has the same time ticks
        trial.loc[datetime.timedelta(seconds=-1)] = [pd.NA] * len(trial.columns)
        # just in case the trial is too short, add row at 18s (this ensures we have all trials at the same length - then they can e.g. easily go into an array)
        trial.loc[datetime.timedelta(seconds=18)] = [pd.NA] * len(trial.columns)
        resampled_trial = trial.resample(str(time_step) + "ns").agg(
            {
                "Stim eye - Size Mm": "mean",
                "Right - Size Mm": "mean",
                "Left - Size Mm": "mean",
            }
        )
        # cut trial to 18 s
        resampled_trial = resampled_trial[
            datetime.timedelta(seconds=-1) : datetime.timedelta(seconds=18)
        ]

        # remake trial time column in seconds from new index
        resampled_trial["Trial time Sec"] = resampled_trial.index
        resampled_trial["Trial time Sec"] = resampled_trial["Trial time Sec"].apply(
            lambda x: x.total_seconds()
        )

        # mark trial based on mappings
        resampled_trial["Trial no"] = [trial_no] * len(resampled_trial)
        resampled_trial["Trial type"] = [stim_list[i]] * len(resampled_trial)
        resampled_trial["Block"] = [block_list[i]] * len(resampled_trial)
        resampled_trial["Test"] = [test_list[i]] * len(resampled_trial)
        resampled_trial["Recording id"] = [recording_list[i]] * len(resampled_trial)
        resampled_trial["Eye"] = [eye_list[i]] * len(resampled_trial)
        resampled_trial["Participant id"] = [participant] * len(resampled_trial)

        # mark trial phases based on protocol
        resampled_trial.loc[resampled_trial["Trial time Sec"] < 0, "Trial phase"] = (
            "pre-stim"
        )
        resampled_trial.loc[
            (resampled_trial["Trial time Sec"] >= 0)
            & (resampled_trial["Trial time Sec"] <= 5),
            "Trial phase",
        ] = "stim"
        resampled_trial.loc[resampled_trial["Trial time Sec"] > 5, "Trial phase"] = (
            "post-stim"
        )

        trials_for_new_df.append(resampled_trial)

    new_df = pd.concat(trials_for_new_df)
    new_df.reset_index(inplace=True)
    return new_df

In [16]:
data_dir = "./results/new/"

participant_id = 209
filepath = os.path.join(data_dir, str(participant_id) + "_recording_data.csv")
data_df = pd.read_csv(
    filepath,
    usecols=[
        "Trial time Sec",
        "Right - Size Mm",
        "Left - Size Mm",
        "Stim eye - Size Mm",
        "Eye",
        "Block",
        "Test",
        "Participant id",
        "Recording id",
        "Trial phase",
        "Trial no",
        "Trial type",
    ],
)

In [None]:
resampled_df = resample_by_trial(data_df, sample_freq=30)

In [22]:
resampled_df

Unnamed: 0,Trial time datetime,Stim eye - Size Mm,Right - Size Mm,Left - Size Mm,Trial time Sec,Trial no,Trial type,Block,Test,Recording id,Eye,Participant id,Trial phase
0,-1 days +23:59:59,4.53878,,4.53878,-1.000000,1.0,s,0,a,0,L,209,pre-stim
1,-1 days +23:59:59.033333334,4.52158,,4.52158,-0.966667,1.0,s,0,a,0,L,209,pre-stim
2,-1 days +23:59:59.066666668,4.53736,,4.53736,-0.933334,1.0,s,0,a,0,L,209,pre-stim
3,-1 days +23:59:59.100000002,4.54607,,4.54607,-0.900000,1.0,s,0,a,0,L,209,pre-stim
4,-1 days +23:59:59.133333336,4.56269,,4.56269,-0.866667,1.0,s,0,a,0,L,209,pre-stim
...,...,...,...,...,...,...,...,...,...,...,...,...,...
327745,0 days 00:00:17.833333710,6.73492,5.86592,6.73492,17.833333,575.0,flux,10,b,23,L,209,post-stim
327746,0 days 00:00:17.866667044,6.71676,5.90626,6.71676,17.866667,575.0,flux,10,b,23,L,209,post-stim
327747,0 days 00:00:17.900000378,6.71894,5.89317,6.71894,17.900000,575.0,flux,10,b,23,L,209,post-stim
327748,0 days 00:00:17.933333712,6.65416,5.96758,6.65416,17.933333,575.0,flux,10,b,23,L,209,post-stim


In [25]:
resampled_df

Unnamed: 0,Trial time datetime,Stim eye - Size Mm,Right - Size Mm,Left - Size Mm,Trial time Sec,Trial no,Trial type,Block,Test,Recording id,Eye,Participant id,Trial phase
0,-1 days +23:59:59,4.53878,,4.53878,-1.000000,1.0,s,0,a,0,L,209,pre-stim
1,-1 days +23:59:59.033333334,4.52158,,4.52158,-0.966667,1.0,s,0,a,0,L,209,pre-stim
2,-1 days +23:59:59.066666668,4.53736,,4.53736,-0.933334,1.0,s,0,a,0,L,209,pre-stim
3,-1 days +23:59:59.100000002,4.54607,,4.54607,-0.900000,1.0,s,0,a,0,L,209,pre-stim
4,-1 days +23:59:59.133333336,4.56269,,4.56269,-0.866667,1.0,s,0,a,0,L,209,pre-stim
...,...,...,...,...,...,...,...,...,...,...,...,...,...
327745,0 days 00:00:17.833333710,6.73492,,6.73492,17.833333,575.0,flux,10,b,23,L,209,post-stim
327746,0 days 00:00:17.866667044,6.71676,,6.71676,17.866667,575.0,flux,10,b,23,L,209,post-stim
327747,0 days 00:00:17.900000378,6.71894,,6.71894,17.900000,575.0,flux,10,b,23,L,209,post-stim
327748,0 days 00:00:17.933333712,6.65416,,6.65416,17.933333,575.0,flux,10,b,23,L,209,post-stim


The result is a dataframe with trials resampled to 30 Hz, starting at -1 s and ending at 18 s. Thanks to hard coding the trial start and end time, each trial will have the same time stamps for samples, enabling calculation of mean etc. Resampling is done with all NaN values in stimulated eye removed.

### <a id='toc1_1_1_'></a>[Resampling with removal](#toc0_)

Legacy function for resampling data after removal of NaN values in the stimulated eye at timestamps where the other eye is measured.

In [6]:
def mark_not_measured(data_df):
    data_df["Stim eye - Measured"] = [False] * len(data_df)
    data_df.loc[data_df["Stim eye - Size Mm"].notna(), "Stim eye - Measured"] = True
    data_df.loc[
        (data_df["Right - Size Mm"].isna()) & (data_df["Left - Size Mm"].isna()),
        "Stim eye - Measured",
    ] = "missing"
    return data_df

In [7]:
data_df = mark_not_measured(data_df)
data_df = data_df[data_df["Stim eye - Measured"] != False]
resampled_df = resample_by_trial(data_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_subset["Trial time datetime"] = data_subset["Trial time Sec"].apply(


In [8]:
resampled_df

Unnamed: 0,Trial time datetime,Stim eye - Size Mm,Trial time Sec,Trial no,Trial type,Block,Test,Recording id,Eye,Participant id,Trial phase
0,-1 days +23:59:59,4.53878,-1.00,1.0,s,0,a,0,L,209,pre-stim
1,-1 days +23:59:59.020000,,-0.98,1.0,s,0,a,0,L,209,pre-stim
2,-1 days +23:59:59.040000,4.52158,-0.96,1.0,s,0,a,0,L,209,pre-stim
3,-1 days +23:59:59.060000,4.53736,-0.94,1.0,s,0,a,0,L,209,pre-stim
4,-1 days +23:59:59.080000,,-0.92,1.0,s,0,a,0,L,209,pre-stim
...,...,...,...,...,...,...,...,...,...,...,...
546820,0 days 00:00:17.920000,6.71894,17.92,575.0,flux,10,b,23,L,209,post-stim
546821,0 days 00:00:17.940000,,17.94,575.0,flux,10,b,23,L,209,post-stim
546822,0 days 00:00:17.960000,6.65416,17.96,575.0,flux,10,b,23,L,209,post-stim
546823,0 days 00:00:17.980000,6.62814,17.98,575.0,flux,10,b,23,L,209,post-stim
