**Table of contents**<a id='toc0_'></a>    
- [Purpose of the notebook](#toc1_1_)    
- [General guidelines based on EDA](#toc2_)    
- [Removal based on data percentage threshold](#toc3_)    
- [Removal based on length of NaN sequences](#toc4_)    
- [Condition rejection](#toc5_)    
- [Block rejection](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Purpose of the notebook](#toc0_)

This is a notebook for writing an algorithm to accept/ reject trials/ blocks/ participants. It's required because for quality interpolation and analysis, we need to make sure that the data is complete enough regarding missing values, gaps in signal, sufficient number of trials for a condition in a block to average. The following functions' results are demonstrated on participant 209 data. To get data, run load_and_resample.py script.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import loading_utils as load
import preprocessing_utils as prep

participant_list = [200, 201, 202, 204, 205, 206, 207, 209, 210, 211, 212, 213]

# <a id='toc2_'></a>[General guidelines based on EDA on trial completeness](#toc0_)

<b>Trial acceptance thresholds:</b>

at least 75% not-nan in 0:6 s period from data at 30 Hz - to ensure analysis power

at least 40% not-nan in baseline at 30 Hz - to ensure there are enough samples to average the baseline for the period of interest

no NaN sequence longer than x samples/ms - to be determined from EDA - to avoid interpolation of long stretches of data which can affect signal quality

<b>Condition acceptance threshold:</b>

in a block: minimum 3 trials in condition - to ensure ability to average


<b>Block acceptance threshold:</b>

minimum 3 trials in flux - it's the base condition for evaluation of other conditions

minimum 1 other condition with 3 trials than flux - we need something to evaluate vs flux



# <a id='toc3_'></a>[Removal based on data percentage threshold](#toc0_)

Function takes desired minimum for percentage of data present in baseline and period of interest (poi), as well as baseline time borders and period of interest time borders (in seconds). It returns dataframe with trials that don't meet the minimum removed. Default values are 40% for baseline and 75% for poi.

In [None]:
def remove_trials_below_percentage(
    resampled_df,
    baseline_threshold=40,
    poi_threshold=75,
    baseline_time=[-1, 0],
    poi_time=[0, 6],
):

    resampled_df = resampled_df.copy()

    # compute poi data percentage present in trials
    poi_df = resampled_df[
        (resampled_df["Trial time Sec"] >= poi_time[0])
        & (resampled_df["Trial time Sec"] <= poi_time[1])
    ]
    poi_groupby_df = (
        poi_df[["Trial no", "Stim eye - Size Mm"]]
        .groupby(["Trial no"])
        .agg(["count", "size"])
        .reset_index()
    )
    poi_groupby_df[("Stim eye - Size Mm", "count/size ratio")] = (
        poi_groupby_df[("Stim eye - Size Mm", "count")]
        / poi_groupby_df[("Stim eye - Size Mm", "size")]
    ) * 100

    # compute baseline data percentage present in trials
    baseline_df = resampled_df[
        (resampled_df["Trial time Sec"] >= baseline_time[0])
        & (resampled_df["Trial time Sec"] <= baseline_time[1])
    ]
    baseline_groupby_df = (
        baseline_df[["Trial no", "Stim eye - Size Mm"]]
        .groupby(["Trial no"])
        .agg(["count", "size"])
        .reset_index()
    )
    baseline_groupby_df[("Stim eye - Size Mm", "count/size ratio")] = (
        baseline_groupby_df[("Stim eye - Size Mm", "count")]
        / baseline_groupby_df[("Stim eye - Size Mm", "size")]
    ) * 100

    # find trials matching poi condition and baseline condition
    pois_above_threshold = (
        poi_groupby_df[("Stim eye - Size Mm", "count/size ratio")] >= poi_threshold
    )
    baselines_above_threshold = (
        baseline_groupby_df[("Stim eye - Size Mm", "count/size ratio")]
        >= baseline_threshold
    )
    trials_accepted_indices = pois_above_threshold & baselines_above_threshold
    trials_accepted = poi_groupby_df[("Trial no", "")][trials_accepted_indices]

    # select only found trials from original dataframe
    removed_df = resampled_df[resampled_df["Trial no"].isin(trials_accepted)]
    removed_df = removed_df.reset_index(drop=True)

    return removed_df

In [50]:
data_dir = "./results/resampled/"  # directory with resampled data
data_suffix = "_nonan_30_resampled_data.csv"  # name of file with 30 Hz resampled data from participant 2xx, name format: 2xxdata_suffix

data_path = os.path.join(data_dir, str(209) + data_suffix)
data_df = pd.read_csv(data_path)

thresholded_df = remove_trials_below_percentage(
    data_df,
    baseline_threshold=40,
    poi_threshold=75,
    baseline_time=[-1, 0],
    poi_time=[0, 6],
)

In [51]:
no_trials_before_threshold = len(data_df["Trial no"].unique())
no_trials_after_threshold = len(thresholded_df["Trial no"].unique())
print(f"Number of trials before thresholding: {no_trials_before_threshold}")
print(f"Number of trials after thresholding: {no_trials_after_threshold}")

Number of trials before thresholding: 575
Number of trials after thresholding: 129


In [52]:
thresholded_df

Unnamed: 0.1,Unnamed: 0,Trial time datetime,Stim eye - Size Mm,Trial time Sec,Trial no,Trial type,Block,Test,Recording id,Eye,Participant id,Trial phase
0,9690,-1 days +23:59:59,5.67868,-1.000000,18.0,s,0,a,0,L,209,pre-stim
1,9691,-1 days +23:59:59.033333334,,-0.966667,18.0,s,0,a,0,L,209,pre-stim
2,9692,-1 days +23:59:59.066666668,,-0.933334,18.0,s,0,a,0,L,209,pre-stim
3,9693,-1 days +23:59:59.100000002,5.21232,-0.900000,18.0,s,0,a,0,L,209,pre-stim
4,9694,-1 days +23:59:59.133333336,5.20528,-0.866667,18.0,s,0,a,0,L,209,pre-stim
...,...,...,...,...,...,...,...,...,...,...,...,...
73525,327745,0 days 00:00:17.833333710,6.73492,17.833333,575.0,flux,10,b,23,L,209,post-stim
73526,327746,0 days 00:00:17.866667044,6.71676,17.866667,575.0,flux,10,b,23,L,209,post-stim
73527,327747,0 days 00:00:17.900000378,6.71894,17.900000,575.0,flux,10,b,23,L,209,post-stim
73528,327748,0 days 00:00:17.933333712,6.65416,17.933333,575.0,flux,10,b,23,L,209,post-stim


# <a id='toc4_'></a>[Removal based on length of NaN sequences](#toc0_)

Remove trials with NaN sequences in period of interest longer than desired limit in miliseconds. Function takes limit of gap length in ms, sampling rate to establish the gap length limit in samples, time borders for period of interest. 

In [53]:
def find_consecutive_nans(trial):
    # find and make a series of found nan sequences
    nan_list = (
        trial["Stim eye - Size Mm"]
        .isnull()
        .astype(int)
        .groupby(trial["Stim eye - Size Mm"].notnull().astype(int).cumsum())
        .sum()
    )
    return nan_list


def remove_trials_with_long_nans(
    thresholded_df, fs=30, max_nan_length=500, poi_time=[0, 6]
):
    # select rows in the period of interest
    data_df = thresholded_df[
        (thresholded_df["Trial time Sec"] >= poi_time[0])
        & (thresholded_df["Trial time Sec"] <= poi_time[1])
    ].copy()

    # mark NaN sequences in a counter (e.g. for sequence: 7,NaN,NaN,NaN,5 the counter is: 0,1,2,3,0)
    data_df["NaN counter"] = pd.Series()

    for trial_no in sorted(data_df["Trial no"].unique()):
        trial = data_df[data_df["Trial no"] == trial_no]
        trial_nan_counter = (
            trial["Stim eye - Size Mm"]
            .isnull()
            .astype(int)
            .groupby(trial["Stim eye - Size Mm"].notnull().astype(int).cumsum())
            .cumsum()
        )
        data_df.loc[data_df["Trial no"] == trial_no, "NaN counter"] = trial_nan_counter

    # find trials in which a value of the counter exceeds max nan length in samples
    trials_above_max = data_df["Trial no"][
        data_df["NaN counter"] > (max_nan_length / fs)
    ].unique()

    # select trials without sequences exceeding the limit
    removed_df = thresholded_df[~thresholded_df["Trial no"].isin(trials_above_max)]
    removed_df = removed_df.reset_index(drop=True)
    return removed_df

In [54]:
low_nan_df = remove_trials_with_long_nans(thresholded_df, max_nan_length=500)

In [55]:
no_trials_before_threshold = len(thresholded_df["Trial no"].unique())
no_trials_after_threshold = len(low_nan_df["Trial no"].unique())
print(f"Number of trials before thresholding: {no_trials_before_threshold}")
print(f"Number of trials after thresholding: {no_trials_after_threshold}")

Number of trials before thresholding: 129
Number of trials after thresholding: 110


In [56]:
low_nan_df

Unnamed: 0.1,Unnamed: 0,Trial time datetime,Stim eye - Size Mm,Trial time Sec,Trial no,Trial type,Block,Test,Recording id,Eye,Participant id,Trial phase
0,9690,-1 days +23:59:59,5.67868,-1.000000,18.0,s,0,a,0,L,209,pre-stim
1,9691,-1 days +23:59:59.033333334,,-0.966667,18.0,s,0,a,0,L,209,pre-stim
2,9692,-1 days +23:59:59.066666668,,-0.933334,18.0,s,0,a,0,L,209,pre-stim
3,9693,-1 days +23:59:59.100000002,5.21232,-0.900000,18.0,s,0,a,0,L,209,pre-stim
4,9694,-1 days +23:59:59.133333336,5.20528,-0.866667,18.0,s,0,a,0,L,209,pre-stim
...,...,...,...,...,...,...,...,...,...,...,...,...
62695,327745,0 days 00:00:17.833333710,6.73492,17.833333,575.0,flux,10,b,23,L,209,post-stim
62696,327746,0 days 00:00:17.866667044,6.71676,17.866667,575.0,flux,10,b,23,L,209,post-stim
62697,327747,0 days 00:00:17.900000378,6.71894,17.900000,575.0,flux,10,b,23,L,209,post-stim
62698,327748,0 days 00:00:17.933333712,6.65416,17.933333,575.0,flux,10,b,23,L,209,post-stim


# <a id='toc5_'></a>[Condition rejection](#toc0_)

Minimum 3 trials in a condition in a block for that condition to be kept in the block. Trial minimum can be adjusted as an argument. Returns dataframe with only conditions per block left that meet the requirement.

In [None]:
def remove_bad_conditions(data_df, trial_min=3):
    # aggregate unique trial numbers in each block-condition group
    groupby_condition_df = (
        data_df[["Block", "Trial type", "Trial no"]]
        .groupby(["Block", "Trial type"])
        .agg({"Trial no": "nunique"})
    )
    groupby_condition_df.reset_index(inplace=True)

    # find block-condition pairs where a condition has less than 3 trials
    low_cond_block_pairs = [
        (block, cond)
        for (block, cond) in zip(
            groupby_condition_df["Block"], groupby_condition_df["Trial type"]
        )
        if groupby_condition_df["Trial no"][
            (groupby_condition_df["Block"] == block)
            & (groupby_condition_df["Trial type"] == cond)
        ].values
        < trial_min
    ]
    # find trial numbers corresponding to the pairs above
    low_cond_trials = [
        trial_no
        for block, cond in low_cond_block_pairs
        for trial_no in data_df["Trial no"][
            (data_df["Block"] == block) & (data_df["Trial type"] == cond)
        ]
    ]

    # remove the found trials from dataframe
    removed_df = data_df[~data_df["Trial no"].isin(low_cond_trials)]
    removed_df = removed_df.reset_index(drop=True)
    return removed_df

In [58]:
no_low_cond_df = remove_bad_conditions(low_nan_df, trial_min=3)

In [59]:
groupby_condition_df = (
    no_low_cond_df[["Block", "Trial type", "Trial no"]]
    .groupby(["Block", "Trial type"])
    .agg({"Trial no": "nunique"})
)

In [60]:
groupby_condition_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Trial no
Block,Trial type,Unnamed: 2_level_1
1,s,3
2,flux,3
2,l-m,4
2,mel,3
5,flux,4
5,l-m,8
5,lms,9
5,mel,8
5,s,5
8,l-m,5


As can be seen in the groupby dataframe counting unique trials per condition, per block, only conditions with 3 or more trials remain. In block removal, we'd expect blocks 2, 5, 10 to remain.

# <a id='toc6_'></a>[Block rejection](#toc0_)

Minimum 3 trials in flux, minimum 3 trials in another condition in a block. Otherwise, it's rejected.

In [None]:
def remove_bad_blocks(data_df):

    # aggregate unique trial numbers in each block-condition group
    groupby_condition_df = (
        data_df[["Block", "Trial type", "Trial no"]]
        .groupby(["Block", "Trial type"])
        .agg("nunique")
    )
    groupby_condition_df.reset_index(inplace=True)

    # find blocks with no flux
    blocks_no_flux = [
        block
        for block in groupby_condition_df["Block"].unique()
        if "flux"
        not in groupby_condition_df["Trial type"][
            (groupby_condition_df["Block"] == block)
        ].to_list()
    ]

    # find blocks with one condition - this takes care of blocks where flux is the only one, shorter code than conditions on other-than-flux
    blocks_no_other = [
        block
        for block in groupby_condition_df["Block"].unique()
        if len(
            groupby_condition_df["Trial type"][(groupby_condition_df["Block"] == block)]
        )
        == 1
    ]

    removed_df = data_df[
        (~data_df["Block"].isin(blocks_no_flux))
        & (~data_df["Block"].isin(blocks_no_other))
    ]
    removed_df = removed_df.reset_index(drop=True)
    return removed_df

In [None]:
no_low_block_df = remove_bad_blocks(no_low_cond_df)

In [63]:
groupby_condition_df = (
    no_low_block_df[["Block", "Trial type", "Trial no"]]
    .groupby(["Block", "Trial type"])
    .agg("nunique")
)

In [64]:
groupby_condition_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Trial no
Block,Trial type,Unnamed: 2_level_1
2,flux,3
2,l-m,4
2,mel,3
5,flux,4
5,l-m,8
5,lms,9
5,mel,8
5,s,5
10,flux,3
10,lms,3


As can be seen in the groupby, only blocks that have flux and at least one other condition remain. It's congruent with expectation from previous section.