## Purpose of the notebook

This notebook was the first notebook made in order to figure out a method for loading the data and protocol, and whether to use the whole experiment recording or the sequence recordings.

### Making filepaths for protocol and data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import os
import pupilprep_utilities.loading_utils as load

rootdir = "..\..\Simulated_folder_2"

fp_protocol, fp_recording, fp_whole_exp = load.make_filepaths(rootdir)

print(fp_protocol, fp_recording, fp_whole_exp)

..\..\Simulated_folder_2\201_20240305_right_sine_b.csv ['..\\..\\Simulated_folder_2\\Experiment_0001\\Sequence_0001\\data.csv', '..\\..\\Simulated_folder_2\\Experiment_0001\\Sequence_0002\\data.csv', '..\\..\\Simulated_folder_2\\Experiment_0001\\Sequence_0003\\data.csv'] ..\..\Simulated_folder_2\Experiment_0001\data.csv


### Making protocol dataframes

I split them into one with description variables and one with the timecourse, since they have different column numbers.

In [3]:
protocol_vars_df, protocol_timecourse_df = load.make_protocol_dfs(fp_protocol)

protocol_vars_df

Unnamed: 0,Var,Val 1,Val 2
0,LR.exp,civibe_201,
1,Date,09.11.2023,
2,Author(s),Hannah S. Heinrichs,
3,Photoreceptors,CIE tooolbox,
4,Calibration,Source,20230911.0
5,Version,1,0.0
6,,,
7,Sampling time [ms],33,
8,Start delay [s],0,0.0
9,Temperature aquisition interval [tick],20,


In [4]:
protocol_timecourse_df

Unnamed: 0,NumSample,Label L,LED L1,LED L2,LED L3,LED L4,LED L5,LED L6,L L,M L,...,LED R2,LED R3,LED R4,LED R5,LED R6,L R,M R,S R,g R,Eye
0,0,dark,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,0,0,R
1,7273,dark,0,0,0,0,0,0,0,0,...,0.500000,0.500000,0.500000,0.500000,0.500000,50,50,50,50,R
2,0,dark,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0,0,0,0,R
3,1,dark,0,0,0,0,0,0,0,0,...,0.500000,0.500000,0.500000,0.500000,0.500000,100,100,100,0,R
4,1,dark,0,0,0,0,0,0,0,0,...,0.515792,0.462014,0.525009,0.537161,0.475807,100,100,100,0,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3847,1,dark,0,0,0,0,0,0,0,0,...,0.451295,0.451295,0.451295,0.451295,0.451295,100,100,100,100,R
3848,1,dark,0,0,0,0,0,0,0,0,...,0.463177,0.463177,0.463177,0.463177,0.463177,100,100,100,100,R
3849,1,dark,0,0,0,0,0,0,0,0,...,0.475311,0.475311,0.475311,0.475311,0.475311,100,100,100,100,R
3850,1,dark,0,0,0,0,0,0,0,0,...,0.487613,0.487613,0.487613,0.487613,0.487613,100,100,100,100,R


### Making concatenated dataframe

I pull all data files from sequences and concatenate them into one experiment df. I add a column specifying which trial (or adaptation) the specific row belongs to.

In [5]:
concat_df = load.make_concat_df(fp_recording, fp_protocol)
concat_df

Unnamed: 0,Overall time Sec,Sequence time Sec,Experiment state,Sequence index,Sequences count,Excitation index,Excitation label - Left,Excitation label - Right,Left - Is found,Left - Size Mm,...,Right - Is found,Right - Size Mm,Right - Area Mm,Right - RadiusA Px,Right - RadiusB Px,Right - PosX Px,Right - PoxY Px,Right - Distance from focus,Right - Leds temp,Eye
0,0.003,0.004,Active,1,26,2,dark,baseline,False,,...,True,9.37452,69.02204,189.94965,227.10638,351.82538,562.87006,64.18844,,R
1,0.009,0.010,Active,1,26,2,dark,baseline,False,,...,False,,,,,,,,,R
2,0.012,0.012,Active,1,26,2,dark,baseline,True,9.36967,...,False,,,,,,,,,R
3,0.028,0.029,Active,1,26,2,dark,baseline,False,,...,True,9.37800,69.07329,190.04164,227.16499,352.04379,563.15216,64.09731,,R
4,0.052,0.053,Active,1,26,2,dark,baseline,True,9.36619,...,False,,,,,,,,,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1120,280.235,19.370,Active,3,26,310,dark,baseline,True,6.45852,...,False,,,,,,,,,R
1121,280.250,19.385,Active,3,26,310,dark,baseline,False,,...,True,6.52256,33.41378,135.04973,154.63649,396.13867,478.16284,61.65505,,R
1122,280.277,19.411,Active,3,26,310,dark,baseline,True,6.50580,...,False,,,,,,,,,R
1123,280.284,19.419,Active,3,26,310,dark,baseline,False,,...,True,6.54294,33.62298,135.49684,155.09116,396.39682,478.14185,61.61373,,R


In [6]:
concat_df["Excitation label - Right"].unique()

array(['baseline', 's', 'lms'], dtype=object)

In [7]:
whole_df = load.make_whole_exp_df(fp_whole_exp, fp_protocol)
whole_df

Unnamed: 0,Overall time Sec,Sequence time Sec,Experiment state,Sequence index,Sequences count,Excitation index,Excitation label - Left,Excitation label - Right,Left - Is found,Left - Size Mm,...,Right - Size Mm,Right - Area Mm,Right - RadiusA Px,Right - RadiusB Px,Right - PosX Px,Right - PoxY Px,Right - Distance from focus,Right - Leds temp,Eye,Phase
0,0.003,0.004,Active,1,26,2.0,dark,baseline,False,,...,9.37452,69.02204,189.94965,227.10638,351.82538,562.87006,64.18844,,R,Adaptation
1,0.009,0.010,Active,1,26,2.0,dark,baseline,False,,...,,,,,,,,,R,Adaptation
2,0.012,0.012,Active,1,26,2.0,dark,baseline,True,9.36967,...,,,,,,,,,R,Adaptation
3,0.028,0.029,Active,1,26,2.0,dark,baseline,False,,...,9.37800,69.07329,190.04164,227.16499,352.04379,563.15216,64.09731,,R,Adaptation
4,0.052,0.053,Active,1,26,2.0,dark,baseline,True,9.36619,...,,,,,,,,,R,Adaptation
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42586,731.013,19.367,Active,26,26,3852.0,dark,baseline,True,6.65892,...,,,,,,,,,R,pre-stim
42587,731.029,19.383,Active,26,26,3852.0,dark,baseline,False,,...,6.65047,34.73716,137.13200,158.31989,319.60846,490.75089,102.88291,,R,pre-stim
42588,731.046,19.400,Active,26,26,3852.0,dark,baseline,True,6.62445,...,,,,,,,,,R,pre-stim
42589,731.063,19.417,Active,26,26,3852.0,dark,baseline,False,,...,6.61454,34.36286,136.34479,157.51820,318.95346,490.91568,103.39003,,R,pre-stim


In [8]:
whole_df["Excitation label - Right"].unique()

array(['baseline', nan, 's', 'lms', 'l-m', 'mel', 'flux'], dtype=object)

With new data, the whole experiment seems to include all sequences. Sequence index column provides information about trial number. Therefore, using whole test data is more straightforward. However, a 2 step procedure could be implemented to try and get the most data available, e.g. like this:

1. Make whole test dataframe
2. Check length/ number of sequences/ something else. Example: if length of unique array from Sequence index is < 26, return 'incomplete'.
3. If 'incomplete', make concatenated df. Check the same as above. (Here could also simply check the number of filepaths recorded before making the df to save memory). 

### Analysis of differences between whole test data and concatenated data

In [None]:
print(
    "Length of adaptation phase in whole vs concatenated:",
    len(whole_df[whole_df["Sequence index"] == 1]),
    len(concat_df[concat_df["Sequence index"] == 1]),
)
print(
    "Length of trial 1 in whole vs concatenated:",
    len(whole_df[whole_df["Sequence index"] == 2]),
    len(concat_df[concat_df["Sequence index"] == 2]),
)
print(
    "Length of trial 2 in whole vs concatenated:",
    len(whole_df[whole_df["Sequence index"] == 3]),
    len(concat_df[concat_df["Sequence index"] == 3]),
)

Length of adaptation phase in whole vs concatenated: 13911 13911
Length of trial 1 in whole vs concatenated: 1181 1172
Length of trial 2 in whole vs concatenated: 1132 1125


From this data, there is a difference of about 10 samples between concatenated and whole test dataframes in trials. The whole test df seems to have more samples available. Adaptation phase has the same length in both.

In [None]:
print(
    "Start time of trial 1 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 2]["Sequence time Sec"].min(),
    concat_df[concat_df["Sequence index"] == 2]["Sequence time Sec"].min(),
)
print(
    "End time of trial 1 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 2]["Sequence time Sec"].max(),
    concat_df[concat_df["Sequence index"] == 2]["Sequence time Sec"].max(),
)
print(
    "Start time of trial 2 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 3]["Sequence time Sec"].min(),
    concat_df[concat_df["Sequence index"] == 3]["Sequence time Sec"].min(),
)
print(
    "End time of trial 2 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 3]["Sequence time Sec"].max(),
    concat_df[concat_df["Sequence index"] == 3]["Sequence time Sec"].max(),
)

Start time of trial 1 in whole vs concatenated: 0.01 0.01
End time of trial 1 in whole vs concatenated: 19.431 19.431
Start time of trial 2 in whole vs concatenated: 0.009 0.009
End time of trial 2 in whole vs concatenated: 19.437 19.437


Despite the samples difference, the start and end times for the trials overlap.

In [11]:
duplicates_df_1 = pd.concat(
    [
        whole_df[whole_df["Sequence index"] == 2],
        concat_df[concat_df["Sequence index"] == 2],
    ]
).drop_duplicates(subset=["Sequence time Sec", "Overall time Sec"], keep=False)
duplicates_df_2 = pd.concat(
    [
        whole_df[whole_df["Sequence index"] == 3],
        concat_df[concat_df["Sequence index"] == 3],
    ]
).drop_duplicates(subset=["Sequence time Sec", "Overall time Sec"], keep=False)

In [12]:
duplicates_df_1

Unnamed: 0,Overall time Sec,Sequence time Sec,Experiment state,Sequence index,Sequences count,Excitation index,Excitation label - Left,Excitation label - Right,Left - Is found,Left - Size Mm,...,Right - Size Mm,Right - Area Mm,Right - RadiusA Px,Right - RadiusB Px,Right - PosX Px,Right - PoxY Px,Right - Distance from focus,Right - Leds temp,Eye,Phase
13911,241.127,,Passive,2,26,,,,False,,...,6.77379,36.03738,141.627,159.03297,411.24582,479.25546,58.74505,,R,stim
13912,241.135,,Passive,2,26,,,,True,6.61257,...,,,,,,,,,R,stim
13913,241.145,,Passive,2,26,,,,False,,...,6.79426,36.25548,142.15111,159.40555,411.85815,479.33939,58.66689,,R,stim
13914,241.168,,Passive,2,26,,,,True,6.60926,...,,,,,,,,,R,stim
13915,241.177,,Passive,2,26,,,,False,,...,6.81256,36.45108,142.50139,159.87158,412.14117,479.55115,58.45999,,R,stim
13916,241.204,,Passive,2,26,,,,True,6.65414,...,,,,,,,,,R,stim
13917,241.212,,Passive,2,26,,,,False,,...,6.82973,36.63503,142.75627,160.39151,412.8143,479.71127,58.31696,,R,stim
13918,241.238,,Passive,2,26,,,,True,6.66307,...,,,,,,,,,R,stim
13919,241.246,,Passive,2,26,,,,False,,...,6.83521,36.69385,142.87534,160.51515,413.11182,479.89987,58.1385,,R,stim


In [13]:
duplicates_df_2

Unnamed: 0,Overall time Sec,Sequence time Sec,Experiment state,Sequence index,Sequences count,Excitation index,Excitation label - Left,Excitation label - Right,Left - Is found,Left - Size Mm,...,Right - Size Mm,Right - Area Mm,Right - RadiusA Px,Right - RadiusB Px,Right - PosX Px,Right - PoxY Px,Right - Distance from focus,Right - Leds temp,Eye,Phase
15092,260.745,,Passive,3,26,,,,True,7.17632,...,7.17138,40.39203,149.37932,168.99944,418.64935,472.73141,65.7153,,R,stim
15093,260.773,,Passive,3,26,,,,True,7.14545,...,7.13485,39.98156,148.57889,168.1832,418.948,472.98035,65.50363,,R,stim
15094,260.775,,Passive,3,26,,,,False,,...,,,,,,,,,R,stim
15095,260.799,,Passive,3,26,,,,False,,...,7.10217,39.61613,148.0379,167.25502,419.0542,473.10718,65.39074,,R,stim
15096,260.806,,Passive,3,26,,,,True,7.10851,...,,,,,,,,,R,stim
15097,260.832,,Passive,3,26,,,,False,,...,7.06513,39.20399,147.22409,166.42992,418.61539,473.35049,65.09649,,R,stim
15098,260.838,,Passive,3,26,,,,True,7.07677,...,,,,,,,,,R,stim


The extra rows seem to come from the passive experiment state with NaN Sequence time. Pupil size measurements are still present.

In [14]:
concat_df["Experiment state"].unique()

array(['Active'], dtype=object)

The concatenated df doesn't have the Passive status in Experiment state column.

In [15]:
print("Start and end times for overall experiment time")
print(
    "Start time of trial 1 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 2]["Overall time Sec"].min(),
    concat_df[concat_df["Sequence index"] == 2]["Overall time Sec"].min(),
)
print(
    "End time of trial 1 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 2]["Overall time Sec"].max(),
    concat_df[concat_df["Sequence index"] == 2]["Overall time Sec"].max(),
)
print(
    "Start time of trial 2 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 3]["Overall time Sec"].min(),
    concat_df[concat_df["Sequence index"] == 3]["Overall time Sec"].min(),
)
print(
    "End time of trial 2 in whole vs concatenated:",
    whole_df[whole_df["Sequence index"] == 3]["Overall time Sec"].max(),
    concat_df[concat_df["Sequence index"] == 3]["Overall time Sec"].max(),
)

Start and end times for overall experiment time
Start time of trial 1 in whole vs concatenated: 241.127 241.281
End time of trial 1 in whole vs concatenated: 260.702 260.702
Start time of trial 2 in whole vs concatenated: 260.745 260.874
End time of trial 2 in whole vs concatenated: 280.303 280.303


The rows difference is located at the start of the sequence - whole experiment data has extra rows at the beginning.

### Summary of differences

Whole test data has more samples for trials. The extra samples are located at the start of each sequence, for this dataset spanning about 0.1 s, with Experiment state: Passive. This constitutes recording from a transition phase, which still contains pupillometry data. These 'passive' samples have NaN in Sequence time Sec column but they have actual values in Overall time Sec column. 

The data from separate trials doesn't include the positions with Experiment state: Passive. Apart from this, so far the datasets appear to be the same.