In [1]:
# globally useful imports of standard libraries needed in this notebook.
import pandas as pd

In [2]:
# we need to look at a lot of raw table row/column information, so increase maximums
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# Mind Wandering Raw Data Exploration and Cleaning

This notebook contains code and notes for initial raw data analysis and data exploration of the mind wandering dataset. The
dataset was collected by Sydney D'Mellos group on.  The following 2 papers are doing some data modeling
on this data:

- Bixler, R., & D’Mello, S. (2015). Automatic gaze-based detection of mind wandering with 
  metacognitive awareness. In International conference on user modeling, adaptation,
  and personalization (Vol. 9146, pp. 31–43). Springer, Cham. Retrieved
  from http://link.springer.com/ 10.1007/978-3-319-20267-9{_}3 doi: 10.1007/978-3-319-20267-9_3
- Faber, M., Bixler, R., & D’Mello, S. K. (2018). An automated behavioral measure of mind
  wandering during computerized reading. Behavior Research Methods, 50(1), 134-150.

The dataset contains data of participants performing
a reading comprehension task.  The participants read pages, given a page at a time I believe.  There is a binary categorical
variable for each page of text read (again I believe) that indicates whether the subject experienced mind wandering or not
during that period / page / segment of the experiment.  

Each row of the raw dataset is (I believe) just one instance of a page.  There is an indicator of whether the subjects mind wandered during
that period or not.  Actually it may be a count, the subject may be able to record more than 1 instance of mind wandering on each
page.  But we will be trying to create a binary classifier.

So multiple rows are associated with each subject in the experiment.  Not sure if all subjects had same number of rows (pages) or not.

The bulk of the features to be used to train prediction models are eye tracking information, like fixation durations, saccads, etc.  These are
mostly summary information, or statistical summaries, of the eye tracking activity during each page.

The purpose of this notebook is to record some details of the raw data for use in generating models.  Also we create some standard
scikit-learn transformers for useful data transformations of features we may use for training and model building.

# Initial Dataset Characteristics

The dataset is actually a tab separated table of values, with a windows style carriage return
at the end.  But it canbe read in with little preformatting using the following pandas
command:

In [3]:
df_raw = pd.read_csv('../data/mindwandering-raw-data.csv', sep='\t', lineterminator='\r')

In [4]:
df_raw.shape

(4078, 129)

There are 4077 rows and 129 features.  The first 12 feature columns appear to be experimental 
meta-information, identifying the participant, trial, segment and timestamps.

In [5]:
df_raw.columns[:12]

Index(['ParticipantID', 'TrialID', 'TrialIndex', 'SegmentID', 'SegmentIndex',
       'StartTime(ms)', 'EndTime(ms)', 'Length(ms)', 'StartTimestamp',
       'EndTimestamp', 'StartTimeGMT', 'EndTimeGMT'],
      dtype='object')

In [6]:
df_raw[df_raw.columns[:12]].head()

Unnamed: 0,ParticipantID,TrialID,TrialIndex,SegmentID,SegmentIndex,StartTime(ms),EndTime(ms),Length(ms),StartTimestamp,EndTimestamp,StartTimeGMT,EndTimeGMT
0,\n,,,,,,,,,,,
1,\nBE7-P1002-Memphis,MainText,1.0,,57.0,1294808.0,1322271.0,27463.0,1382120000000.0,1382120000000.0,46:38.0,47:05.5
2,\nBE7-P1002-Memphis,MainText,1.0,,56.0,1274590.0,1294808.0,20218.0,1382120000000.0,1382120000000.0,46:17.8,46:38.0
3,\nBE7-P1002-Memphis,MainText,1.0,,55.0,1251125.0,1274590.0,23465.0,1382120000000.0,1382120000000.0,45:54.3,46:17.8
4,\nBE7-P1002-Memphis,MainText,1.0,,54.0,1227726.0,1251125.0,23399.0,1382120000000.0,1382120000000.0,45:30.9,45:54.3


# Initial Data Exploration and Cleaning of Experimental Meta Information

Lets do some data exploration of what looks like the experiment meta information, and clean up the first 15 columns or so
for our use and to better understand the data.

The ParticipanID appears to be a string, it looks like it encodes some sort of a participant number (P1002) and also location where subject
was located (Memphis).  Not sure what BE7 is.

Lets see how many unique participants.

In [7]:
participants = df_raw.ParticipantID.unique()
print(participants)
print(len(participants))

['\n' '\nBE7-P1002-Memphis' '\nBE7-P1003-Memphis' '\nBE7-P1003-ND'
 '\nBE7-P1004-Memphis' '\nBE7-P1004-ND' '\nBE7-P1005-Memphis'
 '\nBE7-P1005-ND' '\nBE7-P1006-Memphis' '\nBE7-P1006-ND' '\nBE7-P1007-ND'
 '\nBE7-P1008-Memphis' '\nBE7-P1009-Memphis' '\nBE7-P1009-ND'
 '\nBE7-P1010-Memphis' '\nBE7-P1010-ND' '\nBE7-P1011-ND' '\nBE7-P1012-ND'
 '\nBE7-P1013-ND' '\nBE7-P1014-ND' '\nBE7-P1015-Memphis' '\nBE7-P1015-ND'
 '\nBE7-P1016-Memphis' '\nBE7-P1016-ND' '\nBE7-P1017-Memphis'
 '\nBE7-P1017-ND' '\nBE7-P1018-Memphis' '\nBE7-P1018-ND'
 '\nBE7-P1019-Memphis' '\nBE7-P1019-ND' '\nBE7-P1020-Memphis'
 '\nBE7-P1020-ND' '\nBE7-P1021-ND' '\nBE7-P1022-Memphis' '\nBE7-P1022-ND'
 '\nBE7-P1023-Memphis' '\nBE7-P1024-Memphis' '\nBE7-P1025-Memphis'
 '\nBE7-P1025-ND' '\nBE7-P1026-Memphis' '\nBE7-P1026-ND'
 '\nBE7-P1027-Memphis' '\nBE7-P1027-ND' '\nBE7-P1028-Memphis'
 '\nBE7-P1028-ND' '\nBE7-P1029-Memphis' '\nBE7-P1029-ND'
 '\nBE7-P1030-Memphis' '\nBE7-P1030-ND' '\nBE7-P1031-Memphis'
 '\nBE7-P1031-ND' '\nBE7-P1

Looks like 136 unique participants.  The last participant id appears to be missing, might be a bad last row of data that should be dropped?

In [8]:
# make a copy to begin manipulating data, initial read will always be in raw_data if needed.
df = df_raw.copy()
df.iloc[4076]

ParticipantID                  \nBE7-P1104-ND
TrialID                              MainText
TrialIndex                                1.0
SegmentID                                 NaN
SegmentIndex                              1.0
StartTime(ms)                             0.0
EndTime(ms)                           29171.0
Length(ms)                            29171.0
StartTimestamp                1387130000000.0
EndTimestamp                  1387130000000.0
StartTimeGMT                          17:59.0
EndTimeGMT                            18:27.9
ValidityRate                         0.920286
PageFixations                           120.0
WindowFixations                          17.0
PageBlinks                                7.0
WindowBlinks                              1.0
BottomWindowBound_Page                12650.0
TopWindowBound_Page                   16650.0
BottomWindowBound_Session             12650.0
TopWindowBound_Session                16650.0
NumberOfReports                   

The last row looks like invalid data.  Are any other TrialID NaN?  Beginning of data cleaning, we should search for missing values in
other rows later.  But lets see if any other rows beside last one need to be dropped first.

In [9]:
df[df.TrialID.isna()]

Unnamed: 0,ParticipantID,TrialID,TrialIndex,SegmentID,SegmentIndex,StartTime(ms),EndTime(ms),Length(ms),StartTimestamp,EndTimestamp,StartTimeGMT,EndTimeGMT,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks,BottomWindowBound_Page,TopWindowBound_Page,BottomWindowBound_Session,TopWindowBound_Session,NumberOfReports,FirstReportType,FirstReportContent,FirstReportTimestamp,FirstReportTimesGMT,FirstReportSessionTime(ms),FirstReportTrialTime(ms),FirstReportSegmentTime(ms),FixDurN,FixDurMed,FixDurMean,FixDurSD,FixDurMin,FixDurMax,FixDurRange,FixDurSkew,FixDurKur,FxDisp,SacDurN,SacDurMed,SacDurMean,SacDurSD,SacDurMin,SacDurMax,SacDurRange,SacDurSkew,SacDurKur,SacAmpN,SacAmpMed,SacAmpMean,SacAmpSD,SacAmpMin,SacAmpMax,SacAmpRange,SacAmpSkew,SacAmpKur,SacAngAbsN,SacAngAbsMed,SacAngAbsMean,SacAngAbsSD,SacAngAbsMin,SacAngAbsMax,SacAngAbsRange,SacAngAbsSkew,SacAngAbsKur,SacAngRelN,SacAngRelMed,SacAngRelMean,SacAngRelSD,SacAngRelMin,SacAngRelMax,SacAngRelRange,SacAngRelSkew,SacAngRelKur,SacVelN,SacVelMed,SacVelMean,SacVelSD,SacVelMin,SacVelMax,SacVelRange,SacVelSkew,SacVelKur,horizontalSaccadeProp,FxSacRatio,BlinkDurN,BlinkDurMed,BlinkDurMean,BlinkDurSD,BlinkDurMin,BlinkDurMax,BlinkDurRange,BlinkDurSkew,BlinkDurKur,PupilDiametersZN,PupilDiametersZMed,PupilDiametersZMean,PupilDiametersZSD,PupilDiametersZMin,PupilDiametersZMax,PupilDiametersZRange,PupilDiametersZSkew,PupilDiametersZKur,FirstPassFixDurMean,FirstPassFixDurSD,FirstPassFixProp,EndOfClauseFixDurMean,EndOfClauseFixDurSD,EndOfClauseFixProp,RegFixDurMean,RegFixDurSD,RegFixProp,SingleFixDurMean,SingleFixDurSD,SingleFixProp,NoWordFixDurMean,NoWordFixDurSD,NoWordFixProp,GazeFixDurMean,GazeFixDurSD,GazeFixProp,WordSkipProp,propCrossLineSaccades,readingDepth,WordLenToFixDurCorr,FreqToFixDurCorr,NumSynsToFixDurCorr,HypDepthToFixDurCorr
0,\n,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4077,\n,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


I think only last row 4076 is completely missing, so drop it.

In [10]:
df = df.dropna(subset=['TrialID'])

In [11]:
print(df.shape)

(4076, 129)


It might be useful to split these into new columns derived from ParticipantId.  They all seem to use prefix 'BE7', so whatever that 
might indicate, it doesn't appear to be useful.  Rest appears to be a simple participant id number, ranging from 1002 up to 1104.
It is probably better to just extract that number.  The last part does indead look like participant location, either Memphis
or ND (Notre Dame).

In [12]:
# splits on '-' character and expands into new columns in a new dataframe
participant_id_fields = df.ParticipantID.str.split('-', expand=True)
participant_id_fields.columns = ['BE7', 'participant_id', 'participant_location']
participant_id_fields.head()

Unnamed: 0,BE7,participant_id,participant_location
1,\nBE7,P1002,Memphis
2,\nBE7,P1002,Memphis
3,\nBE7,P1002,Memphis
4,\nBE7,P1002,Memphis
5,\nBE7,P1002,Memphis


In [13]:
# drop column 0
participant_id_fields.drop(['BE7'], axis=1, inplace=True)
participant_id_fields.head()

Unnamed: 0,participant_id,participant_location
1,P1002,Memphis
2,P1002,Memphis
3,P1002,Memphis
4,P1002,Memphis
5,P1002,Memphis


In [14]:
# convert participant_id to number, might be easier to work with numeric participant id
participant_id_fields['participant_id'] = participant_id_fields.participant_id.str[1:].astype(int)
participant_id_fields.head()

Unnamed: 0,participant_id,participant_location
1,1002,Memphis
2,1002,Memphis
3,1002,Memphis
4,1002,Memphis
5,1002,Memphis


In [15]:
# drop the old participantID, and add these new columns to our df in progress
df.drop(['ParticipantID'], axis=1, inplace=True)


In [16]:
df = participant_id_fields.join(df)
df.head()

Unnamed: 0,participant_id,participant_location,TrialID,TrialIndex,SegmentID,SegmentIndex,StartTime(ms),EndTime(ms),Length(ms),StartTimestamp,EndTimestamp,StartTimeGMT,EndTimeGMT,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks,BottomWindowBound_Page,TopWindowBound_Page,BottomWindowBound_Session,TopWindowBound_Session,NumberOfReports,FirstReportType,FirstReportContent,FirstReportTimestamp,FirstReportTimesGMT,FirstReportSessionTime(ms),FirstReportTrialTime(ms),FirstReportSegmentTime(ms),FixDurN,FixDurMed,FixDurMean,FixDurSD,FixDurMin,FixDurMax,FixDurRange,FixDurSkew,FixDurKur,FxDisp,SacDurN,SacDurMed,SacDurMean,SacDurSD,SacDurMin,SacDurMax,SacDurRange,SacDurSkew,SacDurKur,SacAmpN,SacAmpMed,SacAmpMean,SacAmpSD,SacAmpMin,SacAmpMax,SacAmpRange,SacAmpSkew,SacAmpKur,SacAngAbsN,SacAngAbsMed,SacAngAbsMean,SacAngAbsSD,SacAngAbsMin,SacAngAbsMax,SacAngAbsRange,SacAngAbsSkew,SacAngAbsKur,SacAngRelN,SacAngRelMed,SacAngRelMean,SacAngRelSD,SacAngRelMin,SacAngRelMax,SacAngRelRange,SacAngRelSkew,SacAngRelKur,SacVelN,SacVelMed,SacVelMean,SacVelSD,SacVelMin,SacVelMax,SacVelRange,SacVelSkew,SacVelKur,horizontalSaccadeProp,FxSacRatio,BlinkDurN,BlinkDurMed,BlinkDurMean,BlinkDurSD,BlinkDurMin,BlinkDurMax,BlinkDurRange,BlinkDurSkew,BlinkDurKur,PupilDiametersZN,PupilDiametersZMed,PupilDiametersZMean,PupilDiametersZSD,PupilDiametersZMin,PupilDiametersZMax,PupilDiametersZRange,PupilDiametersZSkew,PupilDiametersZKur,FirstPassFixDurMean,FirstPassFixDurSD,FirstPassFixProp,EndOfClauseFixDurMean,EndOfClauseFixDurSD,EndOfClauseFixProp,RegFixDurMean,RegFixDurSD,RegFixProp,SingleFixDurMean,SingleFixDurSD,SingleFixProp,NoWordFixDurMean,NoWordFixDurSD,NoWordFixProp,GazeFixDurMean,GazeFixDurSD,GazeFixProp,WordSkipProp,propCrossLineSaccades,readingDepth,WordLenToFixDurCorr,FreqToFixDurCorr,NumSynsToFixDurCorr,HypDepthToFixDurCorr
1,1002,Memphis,MainText,1.0,,57.0,1294808.0,1322271.0,27463.0,1382120000000.0,1382120000000.0,46:38.0,47:05.5,0.873786,100.0,11.0,7.0,0.0,12650.0,16650.0,1307458.0,1311458.0,0.0,none,none,,,,,,11.0,200.0,221.0,101.29462,83.0,366.0,283.0,0.132989,-1.389706,0.429,10.0,17.0,123.3,162.960834,16.0,450.0,434.0,1.244961,0.053705,10.0,132.924203,210.010957,210.956418,80.856303,793.888635,713.032333,2.8609,8.535334,10.0,352.773596,232.045048,168.309166,0.289026,359.406308,359.117281,-0.725668,-1.670323,9.0,353.875658,240.549308,174.528607,4.791021,359.971189,355.180168,-0.857144,-1.710707,10.0,6.684761,5.48446,3.70356,0.407012,10.050112,9.6431,-0.457247,-1.396898,1.0,1.972,0.0,,,,,,,,,38.0,-1.362156,-1.458751,0.415926,-2.365184,-0.868433,1.496751,-0.302367,-1.148603,261.0,102.0,0.55,,,0.0,,,0.0,290.0,83.0,0.45,164.0,80.0,0.55,163.0,89.0,0.45,0.636364,0.1,153.0,-0.375,-0.223,0.404,-0.56
2,1002,Memphis,MainText,1.0,,56.0,1274590.0,1294808.0,20218.0,1382120000000.0,1382120000000.0,46:17.8,46:38.0,0.839242,72.0,11.0,4.0,0.0,12650.0,16650.0,1287240.0,1291240.0,0.0,none,none,,,,,,11.0,183.0,209.090909,107.757556,133.0,499.0,366.0,2.241717,5.662505,0.436,10.0,25.5,104.8,145.311772,16.0,383.0,367.0,1.609426,1.031941,10.0,155.985073,274.13892,289.943023,66.923698,926.600337,859.676639,1.787491,2.228981,10.0,174.196626,179.839838,166.849182,2.67997,359.925463,357.245493,0.031703,-2.129207,9.0,351.088886,198.966328,183.765405,3.750852,359.611378,355.860527,-0.270329,-2.570038,10.0,4.602806,5.216583,3.52537,0.660415,11.163859,10.503445,0.681916,-0.361619,1.0,2.195,0.0,,,,,,,,,71.0,-0.20008,-0.211293,0.297847,-0.938658,0.47558,1.414238,-0.44971,-0.030409,209.0,120.0,0.82,,,0.0,133.0,,0.09,208.0,128.0,0.73,134.0,,0.09,200.0,23.0,0.18,1.384615,0.2,60.0,0.058,,-0.078,-0.657
3,1002,Memphis,MainText,1.0,,55.0,1251125.0,1274590.0,23465.0,1382120000000.0,1382120000000.0,45:54.3,46:17.8,0.762784,73.0,13.0,4.0,0.0,12650.0,16650.0,1263775.0,1267775.0,0.0,none,none,,,,,,13.0,167.0,188.538462,100.261005,83.0,416.0,333.0,1.010131,0.510088,0.554,12.0,42.0,120.5,158.185449,16.0,466.0,450.0,1.434272,0.694127,12.0,118.155517,256.603244,288.874145,66.587272,887.889747,821.302475,1.815159,2.015756,12.0,262.754541,208.224594,165.723937,0.126346,359.952012,359.825666,-0.358558,-1.953259,11.0,5.640831,98.631825,160.84248,0.174334,353.873826,353.636743,1.189756,-0.759343,12.0,5.031037,5.015407,3.913982,0.608951,12.285944,11.676993,0.515436,-0.530908,1.0,1.695,0.0,,,,,,,,,53.0,0.992546,0.974411,0.534047,-0.171852,1.834626,2.006478,-0.345951,-0.968456,200.0,107.0,0.69,,,0.0,,,0.0,192.0,111.0,0.62,183.0,93.0,0.38,183.0,93.0,0.38,1.142857,0.166667,67.0,-0.096,0.159,-0.092,-0.234
4,1002,Memphis,MainText,1.0,,54.0,1227726.0,1251125.0,23399.0,1382120000000.0,1382120000000.0,45:30.9,45:54.3,0.805556,74.0,14.0,9.0,3.0,12650.0,16650.0,1240376.0,1244376.0,0.0,none,none,,,,,,14.0,158.0,198.714286,125.994157,83.0,516.0,433.0,1.562621,2.184748,0.333,13.0,17.0,76.846154,84.129708,16.0,250.0,234.0,1.033058,-0.469978,13.0,87.00475,155.084632,201.615678,16.933348,801.90246,784.969112,3.179386,10.736262,13.0,173.778792,187.886791,165.726863,0.868051,358.535732,357.66768,-0.09278,-2.130169,12.0,30.463093,154.156733,173.179469,3.238186,358.733251,355.495065,0.381106,-2.24494,13.0,3.882799,4.405353,3.557801,0.092029,10.979019,10.88699,0.545283,-0.608007,0.923077,2.785,3.0,167.0,161.333333,9.814955,150.0,167.0,17.0,-1.732051,,78.0,-0.877777,-0.860678,0.325079,-1.563895,-0.070521,1.493373,0.559856,-0.078687,176.0,94.0,0.86,,,0.0,108.0,35.0,0.14,192.0,96.0,0.71,117.0,47.0,0.14,316.0,283.0,0.14,1.5,0.076923,82.0,-0.054,-0.575,-0.117,-0.059
5,1002,Memphis,MainText,1.0,,53.0,1207674.0,1227726.0,20052.0,1382120000000.0,1382120000000.0,45:10.9,45:30.9,0.807149,66.0,12.0,9.0,1.0,4508.0,8508.0,1212182.0,1216182.0,1.0,self-caught,other,1382120000000.0,,1219182.0,1219182.0,11508.0,12.0,191.5,190.166667,41.252732,116.0,250.0,134.0,-0.243659,-0.937062,0.502,11.0,67.0,122.636364,150.207372,16.0,399.0,383.0,1.202905,-0.322871,11.0,119.230911,268.14413,270.211194,68.773714,799.07515,730.301436,1.494452,0.925619,11.0,174.502695,210.187959,154.208257,1.987851,358.756616,356.768765,-0.371691,-1.666184,10.0,181.425399,179.690831,183.741867,0.296653,358.835734,358.53908,-0.00049,-2.567767,11.0,4.685158,4.913032,4.044616,0.661556,11.926495,11.264939,0.699785,-0.512241,1.0,1.692,1.0,233.0,233.0,,233.0,233.0,0.0,,,82.0,0.848486,0.86626,0.189015,0.332403,1.426649,1.094245,0.619191,1.268933,190.0,41.0,1.0,,,0.0,233.0,,0.08,190.0,41.0,1.0,183.0,,0.08,,,0.0,1.041667,0.181818,67.0,0.66,-0.849,-0.344,0.554


Continue cleaning.  Lets see if SegmentID is all NaN, and if so drop it.  Also lets start normalizing the feature names of the columns.


In [17]:
# it appears all of the SegmentID field is NaN, so drop it
num_rows = df.SegmentID.isna().count()

In [18]:
# drop SegmentId
df.drop(['SegmentID'], axis=1, inplace=True)

In [19]:
cols = df.columns.to_list()

In [20]:
cols[0:12] = ['participant_id', 'participant_location', 'trial_id', 'trial_index', 'segment_index', \
              'start_time', 'end_time', 'trial_length', \
              'start_timestamp', 'end_timestamp', \
              'start_time_GMT', 'end_time_GMT']
df.columns = cols

In [21]:
df.iloc[:,:15].head()

Unnamed: 0,participant_id,participant_location,trial_id,trial_index,segment_index,start_time,end_time,trial_length,start_timestamp,end_timestamp,start_time_GMT,end_time_GMT,ValidityRate,PageFixations,WindowFixations
1,1002,Memphis,MainText,1.0,57.0,1294808.0,1322271.0,27463.0,1382120000000.0,1382120000000.0,46:38.0,47:05.5,0.873786,100.0,11.0
2,1002,Memphis,MainText,1.0,56.0,1274590.0,1294808.0,20218.0,1382120000000.0,1382120000000.0,46:17.8,46:38.0,0.839242,72.0,11.0
3,1002,Memphis,MainText,1.0,55.0,1251125.0,1274590.0,23465.0,1382120000000.0,1382120000000.0,45:54.3,46:17.8,0.762784,73.0,13.0
4,1002,Memphis,MainText,1.0,54.0,1227726.0,1251125.0,23399.0,1382120000000.0,1382120000000.0,45:30.9,45:54.3,0.805556,74.0,14.0
5,1002,Memphis,MainText,1.0,53.0,1207674.0,1227726.0,20052.0,1382120000000.0,1382120000000.0,45:10.9,45:30.9,0.807149,66.0,12.0


And continuing, lets investigate trial_id, trial_index and segment_index before looking at the times.
It turns out that the trial_index does not appear to have any useful information, so drop it.

Neither trial_id nor trial_index have any variation, so drop those.

In [22]:
df.trial_id.unique()

array(['MainText'], dtype=object)

In [23]:
df.trial_index.unique()

array([1.])

In [24]:
df.drop(['trial_id', 'trial_index'], axis=1, inplace=True)

In [25]:
df.iloc[:,:15].head()

Unnamed: 0,participant_id,participant_location,segment_index,start_time,end_time,trial_length,start_timestamp,end_timestamp,start_time_GMT,end_time_GMT,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks
1,1002,Memphis,57.0,1294808.0,1322271.0,27463.0,1382120000000.0,1382120000000.0,46:38.0,47:05.5,0.873786,100.0,11.0,7.0,0.0
2,1002,Memphis,56.0,1274590.0,1294808.0,20218.0,1382120000000.0,1382120000000.0,46:17.8,46:38.0,0.839242,72.0,11.0,4.0,0.0
3,1002,Memphis,55.0,1251125.0,1274590.0,23465.0,1382120000000.0,1382120000000.0,45:54.3,46:17.8,0.762784,73.0,13.0,4.0,0.0
4,1002,Memphis,54.0,1227726.0,1251125.0,23399.0,1382120000000.0,1382120000000.0,45:30.9,45:54.3,0.805556,74.0,14.0,9.0,3.0
5,1002,Memphis,53.0,1207674.0,1227726.0,20052.0,1382120000000.0,1382120000000.0,45:10.9,45:30.9,0.807149,66.0,12.0,9.0,1.0


segment_index does have information.  There are 57 unique values.  They are not listed in order when the unique() function
processes this column, which I suspect means that not all participants have data for all segments.

Lets explore the segment_index more. I believe, looking at the start and end times and other information that the segment_index
is probably labeled with a unique value representing the series of experimental subjects for each subject.  It looks like the
experiment had each subject perform 57 "segments".  Though I suspect that some (many) segments are missing for subjects.

Below we find out that there are 101 unique participants, and 57 total segments. But $101 \times 57 = 5757$, but we only have
$4076$ rows right now.

In [26]:
# the current number of rows and features
num_rows, num_features = df.shape
print(num_rows, num_features)

4076 127


In [27]:
# number of unique experiment segment indexes
df.segment_index.unique()

array([57., 56., 55., 54., 53., 52., 51., 50., 49., 48., 46., 45., 44.,
       43., 42., 41., 40., 39., 38., 37., 36., 35., 34., 33., 32., 31.,
       30., 29., 28., 26., 25., 24., 23., 22., 21., 20., 19., 18., 17.,
       16., 15., 14., 13., 12., 11., 10.,  9.,  8.,  7.,  6.,  5.,  4.,
       47., 27.,  3.,  2.,  1.])

In [28]:
# lets make segment_index an integer value as these look like whole numbers, and that will make it easier to 
# transform into a categorical variable if needed
print(df.segment_index.dtype)
df.segment_index = df.segment_index.astype(int)
print(df.segment_index.dtype)

float64
int64


In [29]:
# determine the number of unique segments
num_segments = len(df.segment_index.unique())
print(num_segments)

57


In [30]:
# number of unique participants
df.participant_id.unique()

array([1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012,
       1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023,
       1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034,
       1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045,
       1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056,
       1057, 1058, 1059, 1060, 1061, 1062, 1064, 1065, 1066, 1067, 1068,
       1069, 1070, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080,
       1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091,
       1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102,
       1103, 1104])

In [31]:
# keep track of number of participants.
num_participants = df.participant_id.unique()
print(num_participants)

[1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015
 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029
 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043
 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057
 1058 1059 1060 1061 1062 1064 1065 1066 1067 1068 1069 1070 1072 1073
 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087
 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101
 1102 1103 1104]


Lets make a table of counts of number of rows (which I believe is number of recorded segment_indexes) for each 
participant.

In [32]:
df.participant_id.value_counts(sort=True)

1022    104
1037    102
1019     95
1027     94
1044     89
1003     87
1040     86
1015     84
1016     84
1004     77
1018     77
1048     74
1038     73
1031     71
1045     69
1020     59
1030     58
1083     57
1033     57
1054     56
1036     56
1049     55
1055     54
1075     53
1002     52
1058     51
1081     51
1021     51
1078     51
1008     51
1057     50
1035     50
1097     49
1006     49
1087     48
1013     48
1102     48
1005     47
1072     47
1017     47
1043     46
1046     45
1070     43
1079     42
1066     41
1101     40
1077     39
1104     39
1095     38
1034     38
1032     37
1039     37
1082     37
1025     36
1103     36
1099     35
1050     34
1009     34
1047     33
1088     33
1029     31
1069     31
1096     29
1092     29
1080     28
1061     28
1086     28
1023     27
1042     27
1059     26
1084     26
1028     25
1011     24
1052     23
1007     22
1068     22
1093     21
1098     20
1053     19
1010     18
1056     17
1065     17
1012     16
1085

Well my previos assumption about relationship of participant_ids and the segment_indexes does not appear correct.  I would expect most all
participants to have 57 rows, with none above that.  But number of rows per participant id varies a lot, from highs of 104 to lows of 3.

Lets look at the rows of participant 1022

In [33]:
df[df['participant_id'] == 1022]

Unnamed: 0,participant_id,participant_location,segment_index,start_time,end_time,trial_length,start_timestamp,end_timestamp,start_time_GMT,end_time_GMT,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks,BottomWindowBound_Page,TopWindowBound_Page,BottomWindowBound_Session,TopWindowBound_Session,NumberOfReports,FirstReportType,FirstReportContent,FirstReportTimestamp,FirstReportTimesGMT,FirstReportSessionTime(ms),FirstReportTrialTime(ms),FirstReportSegmentTime(ms),FixDurN,FixDurMed,FixDurMean,FixDurSD,FixDurMin,FixDurMax,FixDurRange,FixDurSkew,FixDurKur,FxDisp,SacDurN,SacDurMed,SacDurMean,SacDurSD,SacDurMin,SacDurMax,SacDurRange,SacDurSkew,SacDurKur,SacAmpN,SacAmpMed,SacAmpMean,SacAmpSD,SacAmpMin,SacAmpMax,SacAmpRange,SacAmpSkew,SacAmpKur,SacAngAbsN,SacAngAbsMed,SacAngAbsMean,SacAngAbsSD,SacAngAbsMin,SacAngAbsMax,SacAngAbsRange,SacAngAbsSkew,SacAngAbsKur,SacAngRelN,SacAngRelMed,SacAngRelMean,SacAngRelSD,SacAngRelMin,SacAngRelMax,SacAngRelRange,SacAngRelSkew,SacAngRelKur,SacVelN,SacVelMed,SacVelMean,SacVelSD,SacVelMin,SacVelMax,SacVelRange,SacVelSkew,SacVelKur,horizontalSaccadeProp,FxSacRatio,BlinkDurN,BlinkDurMed,BlinkDurMean,BlinkDurSD,BlinkDurMin,BlinkDurMax,BlinkDurRange,BlinkDurSkew,BlinkDurKur,PupilDiametersZN,PupilDiametersZMed,PupilDiametersZMean,PupilDiametersZSD,PupilDiametersZMin,PupilDiametersZMax,PupilDiametersZRange,PupilDiametersZSkew,PupilDiametersZKur,FirstPassFixDurMean,FirstPassFixDurSD,FirstPassFixProp,EndOfClauseFixDurMean,EndOfClauseFixDurSD,EndOfClauseFixProp,RegFixDurMean,RegFixDurSD,RegFixProp,SingleFixDurMean,SingleFixDurSD,SingleFixProp,NoWordFixDurMean,NoWordFixDurSD,NoWordFixProp,GazeFixDurMean,GazeFixDurSD,GazeFixProp,WordSkipProp,propCrossLineSaccades,readingDepth,WordLenToFixDurCorr,FreqToFixDurCorr,NumSynsToFixDurCorr,HypDepthToFixDurCorr
1036,1022,Memphis,57,1864752.0,1888051.0,23299.0,1384210000000.0,1384210000000.0,01:04.5,01:27.7,0.795422,49.0,7.0,8.0,1.0,12650.0,16650.0,1877402.0,1881402.0,0.0,none,none,,,,,,7.0,200.0,202.285714,85.036687,84.0,317.0,233.0,0.002577,-1.413854,0.449,6.0,83.0,116.5,64.0492,67.0,233.0,166.0,1.581002,1.871358,6.0,246.544813,343.005935,268.291308,116.571592,868.065928,751.494336,2.006872,4.442103,6.0,176.546306,176.078182,155.675751,0.266343,350.582096,350.315753,-0.00692,-1.875405,5.0,350.315753,213.163926,192.859991,0.790646,356.121491,355.330845,-0.608031,-3.331982,6.0,3.091859,3.116877,1.775185,1.19036,5.787106,4.596746,0.421191,-0.981788,1.0,2.026,1.0,199.0,199.0,,199.0,199.0,0.0,,,126.0,0.891267,0.882128,0.289096,-0.02831,1.552294,1.580604,-0.199512,0.381366,202.0,85.0,1.0,,,0.0,84.0,,0.14,202.0,85.0,1.0,249.0,,0.14,,,0.0,1.428571,0.5,76.0,0.721,-0.632,-0.264,-0.749
1037,1022,Memphis,56,1829462.0,1864752.0,35290.0,1384210000000.0,1384210000000.0,00:29.1,01:04.5,0.745042,92.0,11.0,19.0,3.0,18080.0,22080.0,1847542.0,1851542.0,1.0,self-caught,other,1384210000000.0,,1854542.0,1854542.0,25080.0,11.0,167.0,198.272727,95.576243,83.0,367.0,284.0,0.464422,-0.937233,0.392,10.0,84.0,128.3,104.765399,50.0,400.0,350.0,2.298868,5.697315,10.0,161.891052,218.060744,227.784328,68.996646,848.819685,779.823038,2.850924,8.579318,10.0,251.338777,198.876354,167.840239,10.072162,358.795946,348.723784,-0.212991,-2.290375,9.0,343.710807,237.02806,163.465831,3.226965,358.462804,355.235839,-0.867783,-1.640386,10.0,1.367681,1.779481,0.980893,0.821389,3.999655,3.178266,1.429071,1.978019,0.9,1.7,3.0,166.0,221.666667,126.080662,133.0,366.0,233.0,1.599519,,80.0,-2.015311,-1.943267,0.312361,-2.453623,-1.221631,1.231992,0.549409,-0.623152,214.0,103.0,0.73,,,0.0,149.0,,0.09,230.0,105.0,0.55,208.0,58.0,0.18,150.0,95.0,0.18,1.56,0.4,62.0,-0.377,,0.277,-0.28
1038,1022,Memphis,55,1804781.0,1829462.0,24681.0,1384210000000.0,1384210000000.0,00:04.5,00:29.1,0.833221,70.0,10.0,7.0,1.0,12650.0,16650.0,1817431.0,1821431.0,0.0,none,none,,,,,,10.0,258.0,289.7,136.167421,133.0,566.0,433.0,0.921272,0.308711,0.429,9.0,33.0,55.555556,59.021418,16.0,167.0,151.0,1.567714,0.763106,9.0,130.81274,130.032207,58.107395,21.714876,237.944749,216.229873,0.01257,1.949809,9.0,350.950298,289.742762,122.627795,7.607146,359.550004,351.942859,-1.679253,1.872581,8.0,32.877217,134.78027,165.823026,2.583464,350.789089,348.205625,0.644251,-2.117548,9.0,5.029005,4.815168,3.094392,0.144766,8.74181,8.597044,-0.373252,-1.266466,0.888889,5.794,1.0,117.0,117.0,,117.0,117.0,0.0,,,92.0,0.262781,0.190566,0.255807,-0.397568,0.575999,0.973567,-0.433869,-0.768955,323.0,132.0,0.8,,,0.0,216.0,23.0,0.2,297.0,144.0,0.6,283.0,212.0,0.2,283.0,212.0,0.2,2.285714,0.0,243.0,0.199,-0.32,-0.466,-0.205
1039,1022,Memphis,53,1715868.0,1763198.0,47330.0,1384210000000.0,1384210000000.0,58:35.5,59:22.9,0.701162,125.0,6.0,31.0,1.0,10770.0,14770.0,1726638.0,1730638.0,1.0,self-caught,other,1384210000000.0,,1733638.0,1733638.0,17770.0,6.0,258.0,305.333333,158.266442,150.0,566.0,416.0,0.994045,0.004621,0.46,5.0,17.0,113.2,173.074262,16.0,416.0,400.0,2.014212,4.074272,5.0,89.032209,118.640225,72.084396,83.03934,247.491839,164.4525,2.227669,4.970026,5.0,342.243378,215.267975,187.440087,8.603203,359.171134,350.567931,-0.603737,-3.325525,4.0,186.287342,184.231939,187.922618,16.686037,347.667035,330.980998,-0.002243,-5.986456,5.0,4.884667,3.428414,2.486587,0.594932,5.564513,4.969581,-0.571861,-3.239991,1.0,3.237,1.0,399.0,399.0,,399.0,399.0,0.0,,,93.0,-2.156947,-2.127831,0.225769,-2.79603,-1.647868,1.148162,-0.172314,0.638432,305.0,158.0,1.0,,,0.0,,,0.0,305.0,158.0,1.0,283.0,,0.17,,,0.0,1.666667,0.0,171.0,-0.662,0.518,-0.028,-1.0
1040,1022,Memphis,52,1681928.0,1715868.0,33940.0,1384210000000.0,1384210000000.0,58:01.6,58:35.5,0.760432,93.0,16.0,24.0,3.0,12650.0,16650.0,1694578.0,1698578.0,0.0,none,none,,,,,,16.0,175.0,181.125,61.459879,100.0,299.0,199.0,0.584555,-0.430906,0.431,15.0,17.0,49.933333,57.757457,16.0,200.0,184.0,1.796303,2.414623,15.0,139.892976,218.961851,227.125386,61.128199,856.110462,794.982263,2.092411,4.002957,15.0,173.752093,165.514415,137.407624,2.063834,358.983045,356.919211,0.159204,-1.313686,14.0,347.173156,231.444558,167.837049,1.782085,359.638169,357.856084,-0.673688,-1.814232,15.0,5.692186,6.078172,3.338074,0.527999,12.790329,12.26233,0.469841,-0.08323,1.0,3.869,3.0,166.0,150.0,28.583212,117.0,167.0,50.0,-1.729666,,117.0,-0.579122,-0.458885,0.349213,-0.997958,0.211077,1.209035,0.34006,-1.316619,172.0,49.0,0.62,,,0.0,195.0,19.0,0.19,180.0,29.0,0.38,212.0,83.0,0.25,142.0,12.0,0.12,0.809524,0.266667,83.0,0.023,0.245,0.31,-0.234
1041,1022,Memphis,51,1651053.0,1681928.0,30875.0,1384210000000.0,1384210000000.0,57:30.7,58:01.6,0.85537,97.0,11.0,19.0,3.0,12650.0,16650.0,1663703.0,1667703.0,0.0,none,none,,,,,,11.0,233.0,228.545455,65.137337,117.0,316.0,199.0,-0.508091,-0.840115,0.37,10.0,91.5,113.3,98.370106,16.0,266.0,250.0,0.472497,-1.509195,10.0,201.577321,265.024931,251.642829,57.554322,920.854796,863.300474,2.338433,5.951375,10.0,266.934396,214.642899,162.136996,2.25962,358.903535,356.643915,-0.468099,-1.81243,9.0,9.794656,160.775605,184.455245,2.453821,357.72964,355.275818,0.270674,-2.569691,10.0,4.157725,3.658008,2.252361,0.78501,6.232149,5.447139,-0.2013,-1.808322,1.0,2.219,3.0,116.0,127.0,34.82815,99.0,166.0,67.0,1.279489,,66.0,-0.251881,-0.311247,0.305742,-0.898131,0.087903,0.986034,-0.42435,-1.29128,238.0,66.0,0.64,,,0.0,,,0.0,249.0,57.0,0.45,188.0,77.0,0.36,126.0,12.0,0.18,0.555556,0.1,130.0,-0.173,-0.266,0.725,0.681
1042,1022,Memphis,50,1619477.0,1651053.0,31576.0,1384210000000.0,1384210000000.0,56:59.1,57:30.7,0.822691,93.0,11.0,24.0,2.0,11336.0,15336.0,1630813.0,1634813.0,1.0,self-caught,other,1384210000000.0,,1637813.0,1637813.0,18336.0,11.0,233.0,240.818182,80.600023,117.0,366.0,249.0,-0.247011,-0.764435,0.493,10.0,33.5,108.2,124.982488,17.0,300.0,283.0,1.033865,-1.141795,10.0,158.60215,233.560342,252.793974,84.008909,937.439895,853.430986,2.913055,8.847516,10.0,91.708215,163.048536,176.448926,1.638839,359.819443,358.180604,0.243744,-2.295528,9.0,352.551781,199.598026,184.538437,3.187471,359.489179,356.301708,-0.270538,-2.570308,10.0,3.600536,3.832575,2.194966,0.709944,7.509004,6.79906,0.123605,-0.31227,1.0,2.448,2.0,258.0,258.0,12.727922,249.0,267.0,18.0,,,127.0,-1.311668,-1.281428,0.170083,-1.596734,-0.791267,0.805467,0.632841,0.153464,235.0,82.0,0.91,199.0,,0.09,241.0,59.0,0.18,237.0,87.0,0.82,,,0.0,,,0.0,2.75,0.1,133.0,0.428,-0.645,-0.327,-0.062
1043,1022,Memphis,49,1590132.0,1619477.0,29345.0,1384210000000.0,1384210000000.0,56:29.8,56:59.1,0.733106,75.0,9.0,22.0,1.0,12650.0,16650.0,1602782.0,1606782.0,0.0,none,none,,,,,,9.0,200.0,222.111111,56.385824,167.0,333.0,166.0,1.372892,0.785533,0.465,8.0,33.0,60.375,79.638357,16.0,250.0,234.0,2.458821,6.229953,8.0,213.598318,290.767988,253.17997,77.874094,830.309712,752.435618,1.576674,2.622751,8.0,174.202425,201.610019,146.517997,2.350686,359.278173,356.927487,-0.184977,-1.482669,7.0,343.127248,206.153216,184.28855,2.468693,359.67002,357.201328,-0.374925,-2.7823,8.0,6.34625,6.751489,3.578673,1.447451,12.935693,11.488242,0.417278,0.14162,1.0,4.139,1.0,166.0,166.0,,166.0,166.0,0.0,,,103.0,-0.182854,-0.202588,0.284138,-0.992998,0.389799,1.382797,-0.254482,0.190066,223.0,60.0,0.89,,,0.0,200.0,,0.11,224.0,65.0,0.78,183.0,,0.11,216.0,0.0,0.22,0.85,0.125,56.0,-0.239,0.456,0.548,0.997
1044,1022,Memphis,47,1543018.0,1577542.0,34524.0,1384210000000.0,1384210000000.0,55:42.7,56:17.2,0.687741,92.0,11.0,16.0,2.0,11553.0,15553.0,1554571.0,1558571.0,1.0,self-caught,other,1384210000000.0,,1561571.0,1561571.0,18553.0,11.0,233.0,240.454545,79.317544,116.0,416.0,300.0,0.967733,1.763012,0.361,10.0,33.5,85.2,128.562825,17.0,400.0,383.0,2.137442,3.975067,10.0,143.290707,223.014515,243.44384,71.99275,892.003997,820.011247,2.788184,8.191087,10.0,171.645323,179.045044,166.431573,2.0553,359.074626,357.019325,0.041564,-2.129963,9.0,350.290715,239.326059,172.800739,6.266364,359.206297,352.939933,-0.855668,-1.713469,10.0,4.210021,4.887276,2.737983,0.494017,10.259183,9.765166,0.754216,1.025192,1.0,3.104,2.0,267.0,267.0,141.421356,167.0,367.0,200.0,,,128.0,-1.633054,-1.630792,0.207416,-2.042656,-1.151333,0.891323,0.10395,-0.720505,244.0,87.0,0.82,,,0.0,182.0,94.0,0.18,245.0,100.0,0.64,183.0,,0.09,208.0,13.0,0.18,3.0,0.1,109.0,-0.561,-0.312,-0.032,0.765
1045,1022,Memphis,46,1513123.0,1543018.0,29895.0,1384210000000.0,1384210000000.0,55:12.8,55:42.7,0.706243,79.0,9.0,18.0,3.0,10771.0,14771.0,1523894.0,1527894.0,1.0,self-caught,task-related,1384210000000.0,,1530894.0,1530894.0,17771.0,9.0,150.0,194.222222,63.410917,133.0,300.0,167.0,0.542938,-1.494797,0.414,8.0,100.5,149.875,138.59338,16.0,383.0,367.0,0.706003,-1.080841,8.0,125.812138,248.929432,274.265022,103.350689,910.961931,807.611242,2.576496,6.828063,8.0,94.393915,159.5964,172.111285,4.012638,357.689437,353.676799,0.326944,-2.342899,7.0,16.093445,155.189029,182.322772,1.57992,357.29619,355.71627,0.37365,-2.790656,8.0,3.29752,2.981529,2.392137,0.365197,7.691076,7.325879,0.943358,1.208746,1.0,1.458,3.0,216.0,232.666667,28.867513,216.0,266.0,50.0,1.732051,,16.0,-0.389583,-0.375067,0.12398,-0.562658,-0.171325,0.391332,0.195936,-1.131752,194.0,63.0,1.0,,,0.0,300.0,,0.11,194.0,63.0,1.0,150.0,,0.11,,,0.0,1.416667,0.25,105.0,-0.5,0.18,-0.192,-0.124


Ok our first mistake was assuming that participants had unique id numbers.  However this is not the case.  At least some participant id
numbers are duplicated.  So since participant 1022 was actually 2 participants, the total number of segments could be up to 114, again
assuming the max is 57 segment indexes.

Lets go back and create a participant_id as a string (which was what they had before.  I will use UM-1022 and ND-1022 for University
of Memphis and Notre Dame university respectively.  I will also create a separate participant location column still, as it could be useful
for comparing subject differences across locations.

We will redo the participant_id and participant_location from scratch, then redo all of the other cleaning steps we had done up to this point
next.

In [34]:
# make copy of raw data frame again, so we can create participant id in correct way
df = df_raw.copy()

In [35]:
# drop all rows where TrialID is null, the last row read in from raw data appeard to be bad/empty data so remove it
df = df.dropna(subset=['TrialID'])

In [36]:
# splits on '-' character and expands into new columns in a new dataframe
participant_id_fields = df.ParticipantID.str.split('-', expand=True)
participant_id_fields.columns = ['BE7', 'participant_id', 'participant_location']
participant_id_fields.drop(['BE7'], axis=1, inplace=True)

# map all Memphis locations to UM to regularize categorical variable and resulting participant ids
participant_id_fields['participant_location'] = participant_id_fields.participant_location.map({'Memphis': 'UM', 'ND': 'ND'})

# use participant_location and participant_id to map a new string participant_id for use in this column
def create_custom_participant_id(row):
    """ We remove the initial P and combine participant id and location to create a
    unique id.
    """
    return row[0][1:] + '-' + row[1]

participant_id_fields['participant_id'] = participant_id_fields.apply(create_custom_participant_id, axis=1)

In [37]:
# how many unique participant_ids do we have now that we keep participant locations separated
num_participants = len(participant_id_fields.participant_id.unique())
print(num_participants)

135


In [38]:
# update the working data frame now with this cleaned id and location.
# also perform steps to clean up the column names for the experiment metadata,
# and drop columns that contain no useful information.

# drop the old participantID, and add these new columns to our df in progress
df.drop(['ParticipantID'], axis=1, inplace=True)
df = participant_id_fields.join(df)

# drop SegmentId
df.drop(['SegmentID'], axis=1, inplace=True)

# regularize column names to underscore location for the experimental metadata for the time being
cols = df.columns.to_list()
cols[0:12] = ['participant_id', 'participant_location', 'trial_id', 'trial_index', 'segment_index', \
              'start_time', 'end_time', 'trial_length', \
              'start_timestamp', 'end_timestamp', \
              'start_time_GMT', 'end_time_GMT']
df.columns = cols

# also trial_id and trial_index don't appear to be useful information
df.drop(['trial_id', 'trial_index'], axis=1, inplace=True)

# make segment index an integer as they seem to be whole numbers, and will be easier to
# make categorical if needed
df.segment_index = df.segment_index.astype(int)

# the current number of rows and features
num_rows, num_features = df.shape
num_segments = len(df.segment_index.unique())
print("Current number of rows: ", num_rows)
print("Current number of feature columns: ", num_features)
print("Current number of unique participants: ", num_participants)
print("Current number of experiment segment indexes: ", num_segments)

Current number of rows:  4076
Current number of feature columns:  127
Current number of unique participants:  135
Current number of experiment segment indexes:  57


Ok we now seem to have 135 unique participants.  Lets try again to see the number of segments per participant, and
see what seems to be the pattern of segment numbers for the participants we have.

In [39]:
df.participant_id.value_counts(sort=True)

1083-ND    57
1054-ND    56
1018-UM    56
1037-ND    55
1015-ND    55
1044-ND    55
1027-ND    55
1049-ND    55
1055-ND    54
1019-UM    54
1022-UM    53
1075-ND    53
1004-ND    52
1002-UM    52
1008-UM    51
1021-ND    51
1048-ND    51
1022-ND    51
1058-ND    51
1078-ND    51
1081-ND    51
1003-ND    50
1057-ND    50
1040-ND    49
1097-ND    49
1102-ND    48
1016-UM    48
1013-ND    48
1087-ND    48
1037-UM    47
1020-UM    47
1072-ND    47
1046-ND    45
1030-ND    44
1038-UM    43
1070-ND    43
1079-ND    42
1045-UM    41
1019-ND    41
1066-ND    41
1101-ND    40
1104-ND    39
1027-UM    39
1077-ND    39
1033-ND    38
1095-ND    38
1040-UM    37
1082-ND    37
1003-UM    37
1031-UM    37
1016-ND    36
1103-ND    36
1099-ND    35
1044-UM    34
1050-ND    34
1031-ND    34
1006-ND    34
1047-UM    33
1088-ND    33
1039-ND    33
1032-ND    33
1017-ND    32
1005-ND    32
1036-ND    32
1069-ND    31
1038-ND    30
1025-ND    30
1092-ND    29
1015-UM    29
1096-ND    29
1043-UM    29
1035-U

Ok the expected number of rows/segments per participant looks good here.  But only 1 participant has all 57 segments, and it 
ranges all the way down to a participant with only 1 row/segment.

This is bad news I think for some of the ideas I had had about combining to make them time sequences.  Forst of all the time
will all be of different lengths.  But maybe more troublesome, if, as I assume at the moment, the segments are labeled sequentially,
then there are going to be a lot of gaps between segments.  Meaning that they don't represent a true time series, or alternatively
there are holes of potential segments in the time sequences, that we were thinking of trying to feed to a recurrent
neural network.

We should still get a list of the sequence ids for the participants, to get a feel of how the sequences proceed, the gaps, etc..

In [40]:
df[df.participant_id == '1010-ND'].iloc[:,:15]

Unnamed: 0,participant_id,participant_location,segment_index,start_time,end_time,trial_length,start_timestamp,end_timestamp,start_time_GMT,end_time_GMT,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks
434,1010-ND,ND,33,828163.0,856109.0,27946.0,1381940000000.0,1381940000000.0,30:32.2,31:00.2,0.227259,13.0,6.0,0.0,0.0
435,1010-ND,ND,32,808015.0,828163.0,20148.0,1381940000000.0,1381940000000.0,30:12.1,30:32.2,0.575921,48.0,7.0,0.0,0.0
436,1010-ND,ND,26,680764.0,705336.0,24572.0,1381940000000.0,1381940000000.0,28:04.8,28:29.4,0.720488,63.0,13.0,1.0,0.0
437,1010-ND,ND,12,311669.0,332783.0,21114.0,1381940000000.0,1381940000000.0,21:55.7,22:16.8,0.305567,20.0,6.0,0.0,0.0


Notice the timestamps.  You can see, for example, that end_time matches start_time for segments 32 to 33.  

We could maybe confirm with original data creation team, but I am assuming that dropped segments were removed because they
all represented bad data or bad segments of the participants in the trial.

Lets finish off the experimental metadata.  There appears to be a lot of redundant information available in the time stamps.
The start_time, end_time and trial_length may be sufficent for our current training purposes.  I know that some attempt had been
made to convert the timestamps to actual dates, which could be useful.  I'm not sure what the GMT time columns
represent.  Of course GMT implies Greenwich Mean Time, but these do not look like full 24 hour times.  It does appear
that this is a measure of minutes and seconds.tenths.  And the elapsed time seems to mostly match the trial length in ms.
For example, for segment index 33, trial length was 27.9 seconds.  Adding 28.0 seconds would take 30:32.2 to 31:00.2.

My student had tried converting the timestamp as well before.  Let me see if I can create a time and date from these, assuming something
like a unix timestamp of seconds from the unix epoch here.

In [41]:
# First make an integer.  As we had seen before, seems like this timestamp is too large, as if multiplied by 1000, or
# I'm guessing, when converted to float, some extra 000 got added to digits of output.
#df.start_timestamp.dtype
df.start_timestamp = df.start_timestamp / 1000
df.start_timestamp = df.start_timestamp.astype(int)
df.start_timestamp.dtype

dtype('int64')

In [42]:
df.start_timestamp

1       1382120000
2       1382120000
3       1382120000
4       1382120000
5       1382120000
           ...    
4072    1387130000
4073    1387130000
4074    1387130000
4075    1387130000
4076    1387130000
Name: start_timestamp, Length: 4076, dtype: int64

In [43]:
pd.to_datetime(df['start_timestamp'],unit='s')

1      2013-10-18 18:13:20
2      2013-10-18 18:13:20
3      2013-10-18 18:13:20
4      2013-10-18 18:13:20
5      2013-10-18 18:13:20
               ...        
4072   2013-12-15 17:53:20
4073   2013-12-15 17:53:20
4074   2013-12-15 17:53:20
4075   2013-12-15 17:53:20
4076   2013-12-15 17:53:20
Name: start_timestamp, Length: 4076, dtype: datetime64[ns]

Looking at the raw data, those timestamp fields only have 5 digits, so they are not recoverable by themselves.
But I wonder if the start_time and the start_timestamp need to be combined to get full bits for a good date/time?

In [44]:
# add togeter start_time and start_timestamp, undoing the division by 1000 first here
timestamps = df.start_timestamp * 1000 + df.start_time
timestamps = timestamps.astype(int)

In [45]:
timestamps

1       1382121294808
2       1382121274590
3       1382121251125
4       1382121227726
5       1382121207674
            ...      
4072    1387130109703
4073    1387130083564
4074    1387130056943
4075    1387130029171
4076    1387130000000
Length: 4076, dtype: int64

In [46]:
pd.to_datetime(timestamps, unit='ms')

1      2013-10-18 18:34:54.808
2      2013-10-18 18:34:34.590
3      2013-10-18 18:34:11.125
4      2013-10-18 18:33:47.726
5      2013-10-18 18:33:27.674
                 ...          
4072   2013-12-15 17:55:09.703
4073   2013-12-15 17:54:43.564
4074   2013-12-15 17:54:16.943
4075   2013-12-15 17:53:49.171
4076   2013-12-15 17:53:20.000
Length: 4076, dtype: datetime64[ns]

Ok I believe that is the correct manipulation here.  We can create a valid full date and time time stamp by
combining start_time and start_timestamp, and end_time and end_timestamp.

Lets do that and drop the columns they are derived from.  And lets drop the GMT column now, as I'm not sure we really
need it.

In [47]:
# undo the above manipulation to the start_time, and make all the columns integers before we combine
df.start_timestamp = df.start_timestamp * 1000
df.start_timestamp = df.start_timestamp.astype(int)

df.start_time = df.start_time.astype(int)

df.end_timestamp = df.end_timestamp.astype(int)

df.end_time = df.end_time.astype(int)

In [48]:
df[['start_time', 'end_time', 'start_timestamp', 'end_time', 'end_timestamp']]

Unnamed: 0,start_time,end_time,start_timestamp,end_time.1,end_timestamp
1,1294808,1322271,1382120000000,1322271,1382120000000
2,1274590,1294808,1382120000000,1294808,1382120000000
3,1251125,1274590,1382120000000,1274590,1382120000000
4,1227726,1251125,1382120000000,1251125,1382120000000
5,1207674,1227726,1382120000000,1227726,1382120000000
...,...,...,...,...,...
4072,109703,136766,1387130000000,136766,1387130000000
4073,83564,109703,1387130000000,109703,1387130000000
4074,56943,83564,1387130000000,83564,1387130000000
4075,29171,56943,1387130000000,56943,1387130000000


In [49]:
# convert to full local datetime information, replace start_time column and end_time column
df.start_time = pd.to_datetime(df.start_timestamp + df.start_time, unit='ms')
df.end_time = pd.to_datetime(df.end_timestamp + df.end_time, unit='ms')

In [50]:
# drop the timestamp and GMT columns now.
df.drop(['start_timestamp', 'end_timestamp', 'start_time_GMT', 'end_time_GMT'], axis=1, inplace=True)

In [51]:
# one final fix, trial_length is actually an integer, number of ms of the trial, so lets make it an integer
df.trial_length = df.trial_length.astype(int)

In [52]:
df.iloc[:,:15].head(15)

Unnamed: 0,participant_id,participant_location,segment_index,start_time,end_time,trial_length,ValidityRate,PageFixations,WindowFixations,PageBlinks,WindowBlinks,BottomWindowBound_Page,TopWindowBound_Page,BottomWindowBound_Session,TopWindowBound_Session
1,1002-UM,UM,57,2013-10-18 18:34:54.808,2013-10-18 18:35:22.271,27463,0.873786,100.0,11.0,7.0,0.0,12650.0,16650.0,1307458.0,1311458.0
2,1002-UM,UM,56,2013-10-18 18:34:34.590,2013-10-18 18:34:54.808,20218,0.839242,72.0,11.0,4.0,0.0,12650.0,16650.0,1287240.0,1291240.0
3,1002-UM,UM,55,2013-10-18 18:34:11.125,2013-10-18 18:34:34.590,23465,0.762784,73.0,13.0,4.0,0.0,12650.0,16650.0,1263775.0,1267775.0
4,1002-UM,UM,54,2013-10-18 18:33:47.726,2013-10-18 18:34:11.125,23399,0.805556,74.0,14.0,9.0,3.0,12650.0,16650.0,1240376.0,1244376.0
5,1002-UM,UM,53,2013-10-18 18:33:27.674,2013-10-18 18:33:47.726,20052,0.807149,66.0,12.0,9.0,1.0,4508.0,8508.0,1212182.0,1216182.0
6,1002-UM,UM,52,2013-10-18 18:33:07.056,2013-10-18 18:33:27.674,20618,0.828618,67.0,12.0,6.0,0.0,12650.0,16650.0,1199706.0,1203706.0
7,1002-UM,UM,51,2013-10-18 18:32:46.688,2013-10-18 18:33:07.056,20368,0.837152,74.0,14.0,7.0,3.0,12650.0,16650.0,1179338.0,1183338.0
8,1002-UM,UM,50,2013-10-18 18:32:27.702,2013-10-18 18:32:46.688,18986,0.834065,63.0,14.0,6.0,2.0,12650.0,16650.0,1160352.0,1164352.0
9,1002-UM,UM,49,2013-10-18 18:32:07.368,2013-10-18 18:32:27.702,20334,0.864754,68.0,10.0,6.0,0.0,12650.0,16650.0,1140018.0,1144018.0
10,1002-UM,UM,48,2013-10-18 18:31:47.116,2013-10-18 18:32:07.368,20252,0.846914,70.0,11.0,2.0,1.0,12650.0,16650.0,1119766.0,1123766.0


At this point we have cleaned what appears to be the experimental metadata somewhat and discovered some information
about the trials / segments, participants, and experiment timing.  Lets designate to this point as df_clean1 where
metadata has been cleaned.

In [53]:
# designate cleaned experiment metadata so far as df_clean1
df_clean1 = df.copy()

num_rows, num_features = df.shape
num_participants = df.participant_id.unique().size
num_segments = df.segment_index.unique().size

print('Number of experimental trial/segment rows: ', num_rows)
print('Current number of feature columns: ', num_features)
print('Number of unique participants: ', num_participants)
print('Maximum number of trial/segments for a participant: ', num_segments)

Number of experimental trial/segment rows:  4076
Current number of feature columns:  123
Number of unique participants:  135
Maximum number of trial/segments for a participant:  57


# Data Exploration and Cleaning of Target Label for Binary Classification

As we have discovered, there is not exactly a single column that is a simply 0/1 for focused / mind wandering.  The following columns contain information about
the potential label we need for training.

There is a NumberOfReports, which we believe is the number of mind wanderings reported by the participant in the segment.
There are then a number of FirstReportX, that seem to record the type and time of the first such report of mind wandering.

Lets look at these.  It is not clear if the FirstReportType or others might be useful as features or not.  We may remove these, or move to a separate
data frame.  We would like to get a clean series of 0/1 results for training a binary classifier on this data.

In [54]:
# make a copy of the clean1 data frame for the following cleaning and exploration
df = df_clean1.copy()

In [55]:
# personal preference, I prefer using regular _ feature/variable names.  So lets rename the columns we want, and pull out only
# those columns into a new temporary dataframe for this section.
report_columns_map = {
    'NumberOfReports': 'number_of_reports',
    'FirstReportType': 'first_report_type',
    'FirstReportContent': 'first_report_content',
    'FirstReportTimestamp': 'first_report_timestamp',
    'FirstReportTimesGMT': 'first_report_times_GMT',
    'FirstReportSessionTime(ms)': 'first_report_session_time',
    'FirstReportTrialTime(ms)': 'first_report_trial_time',
    'FirstReportSegmentTime(ms)': 'first_report_segment_time',
}

df.rename(columns = report_columns_map, inplace=True)

In [56]:
# we'll extract a separate dataframe to only work on the report/output features
report_columns = ['number_of_reports', 'first_report_type', 'first_report_content', 'first_report_timestamp', 
                  'first_report_times_GMT', 'first_report_session_time', 'first_report_trial_time', 
                  'first_report_segment_time']
df_label = df[report_columns].copy()

df_label.head(20)

Unnamed: 0,number_of_reports,first_report_type,first_report_content,first_report_timestamp,first_report_times_GMT,first_report_session_time,first_report_trial_time,first_report_segment_time
1,0.0,none,none,,,,,
2,0.0,none,none,,,,,
3,0.0,none,none,,,,,
4,0.0,none,none,,,,,
5,1.0,self-caught,other,1382120000000.0,,1219182.0,1219182.0,11508.0
6,0.0,none,none,,,,,
7,0.0,none,none,,,,,
8,0.0,none,none,,,,,
9,0.0,none,none,,,,,
10,0.0,none,none,,,,,


Some observations of this data.  
- We will look at the number of reports to find the unique values.  
- A lot of empty data here.  It appears that only self-caught report types have time stamps.
- First report times GMT may be completely empty
- First report session time and trial time appear to be duplicates, should verify and drop.
- We can probably create a valid datetimestamp if we want from the timestamp and session time.
- It is not clear what is difference between session time and segment time is here.  It looks like segment time is about same but less a bit smaller than
  the trial_length.  I'm guessing this was the time relative to the start of the trial / segment of this report?
- We should look at the first report type and first report content.  They look like categorical variables.

First lets look at the unique values for each of the report counts and labels

In [57]:
# number of unique values for NumberOfReports
df_label.number_of_reports.unique()

array([0., 1., 2., 4., 3., 5.])

In [58]:
# number of unique values of first report type
df_label.first_report_type.unique()

array(['none', 'self-caught'], dtype=object)

In [59]:
# number of unique values in first_report_content
df_label.first_report_content.unique()

array(['none', 'other', 'task-related'], dtype=object)

First of all, number of reports can be 0, but can go up to 5.

Also not all report types of self-caught are going to be other.  But are only self-caught types have a content, e.g. are only self-caught
reports either other or task-related?

As mentioned before, we believe that number of reports is basically our binary label here.  If it is 0, then there was no report (self-caught or
otherwise) of mind wandering for the trial, so we want to predict 0 or focused.  If it is non zero, there were 1 or more incidents of mind wandering
during the trial, and we want to predict 1 for mind wandering.

Lets convert the number of reports to an integer and then get a count of each value to get a feel for how often 0 and non zero number of reports
occur.

In [60]:
df_label.number_of_reports = df_label.number_of_reports.astype(int)
df_label.number_of_reports.value_counts()

0    2963
1     980
2     112
3      16
4       3
5       2
Name: number_of_reports, dtype: int64

In [61]:
# get number of nonzero
x = df_label.number_of_reports.value_counts()
x[x.index >= 1].sum()

1113

So this seems logical.  Number of reports decreases in frequency pretty fast.  We have 2963 trials without a report of mind wandering (if our interpretation
is correct) and 1113 with a report of mind wandering.  Thus about 25% of trials are positive, and 75% are negative labels.  This is a bit of a skewed data set
if this interpretation is correct.

Moving on to the other two columns.  Lets get the value counts of the report type and content.  And lets see if none for the report type always means
none for the content, or not.

In [62]:
df_label.first_report_type.value_counts()

none           2963
self-caught    1113
Name: first_report_type, dtype: int64

In [63]:
df_label.first_report_content.value_counts()

none            2963
other            620
task-related     493
Name: first_report_content, dtype: int64

In [64]:
# See if there are rows where if type is none the content is not none
sum( (df_label.first_report_type == 'none') & (df_label.first_report_content != 'none') )

0

In [65]:
# and vice-versa
sum( (df_label.first_report_type != 'none') & (df_label.first_report_content == 'none') )

0

In [66]:
# also, make sure that for all trials, when number of reports is 0 then the report type is none
sum( (df_label.number_of_reports == 0) & (df_label.first_report_type != 'none') )

0

In [67]:
# and finally confirm that when number of reports is 1 or greater, then the report type is always something other than none
sum( (df_label.number_of_reports >= 1) & (df_label.first_report_type == 'none') )

0

That looks pretty definitive to me.  There are 2963 trials where mind wandering does not occur.  There are 1113 trials where 1 or more instances of mind wandering occur.  All trials with no mind wandering have the report type and report content recorded as none.  But all of the 1113 trials that had 1 or
more occurrences of mind wandering, with have either a type of mind wandering as task-related, or as other.  

Some conclusions:

- Could try and build a classifier that detects mind wandering, and then whether the instance was task-related or some other type of mind wandering.
- We can use either the number_of_reports feature or the first_report_type to create a numeric binary variable for 0=focused / 1=mind wandering.
- Data is a bit skewed, with 3 to 1 ratio approximately of negative labels to positive labels

Lets do the following

- We will add a column called mind_wandered_label, which will be an boolean value of 0 for the number_of_reports == 0 and/or first_report_type == self-caught
- We will leave in the number_of_reports and report type and content features on this df_label dataframe.  Can always extract the mind_wandered_label
  label column if needed.
- We will consolidate the timestamps to create a valid datetime value as before.  And drop any unneded time information.
- We will then remove all of these columns from df_clean1, though give it a new name of df_clean2 at this point.

In [68]:
# lets first clean the time information for our report/output dataframe

# drop the first_report_times_GMT column as all values are NaN
# sum( df_label.first_report_times_GMT.isna() ) confir
df_label.drop(['first_report_times_GMT'], axis=1, inplace=True)

In [69]:
# confirm that all first_report_session_time and first_report_trial_time are duplicates, and if so drop the trial time column
# we get 1113 because only the 1113 mind wandering trials have a time, and if they are equal for both columns then all are duplicates
sum( df_label.first_report_session_time.dropna() == df_label.first_report_trial_time.dropna() )

1113

In [70]:
# so drop the first report session time
df_label.drop(['first_report_session_time'], axis=1, inplace=True)

In [71]:
# now use the two columns to create a valid datetime value, and replace the trial time field with the datetime
df_label.first_report_trial_time = pd.to_datetime(df_label.first_report_timestamp + df_label.first_report_trial_time, unit='ms')

In [72]:
# finally drop the no longer needed timestamp feature
df_label.drop(['first_report_timestamp'], axis=1, inplace=True)

In [73]:
df_label.head(20)

Unnamed: 0,number_of_reports,first_report_type,first_report_content,first_report_trial_time,first_report_segment_time
1,0,none,none,NaT,
2,0,none,none,NaT,
3,0,none,none,NaT,
4,0,none,none,NaT,
5,1,self-caught,other,2013-10-18 18:33:39.182000128,11508.0
6,0,none,none,NaT,
7,0,none,none,NaT,
8,0,none,none,NaT,
9,0,none,none,NaT,
10,0,none,none,NaT,


In [74]:
# now add on the mind_wandering_label.  Should be able to use either the number of reports or the report type to do this
# lets just double check that assumption one more time
label1 = df_label.number_of_reports >= 1
label2 = df_label.first_report_type == 'self-caught'

# if all current 4076 trials end up with same label, then both methods are the same here
sum(label1 == label2)

4076

In [75]:
# add the mind_wandering_lable
df_label['mind_wandered_label'] = (df_label.first_report_type == 'self-caught')

In [76]:
df_label.head(20)

Unnamed: 0,number_of_reports,first_report_type,first_report_content,first_report_trial_time,first_report_segment_time,mind_wandered_label
1,0,none,none,NaT,,False
2,0,none,none,NaT,,False
3,0,none,none,NaT,,False
4,0,none,none,NaT,,False
5,1,self-caught,other,2013-10-18 18:33:39.182000128,11508.0,True
6,0,none,none,NaT,,False
7,0,none,none,NaT,,False
8,0,none,none,NaT,,False
9,0,none,none,NaT,,False
10,0,none,none,NaT,,False


In [77]:
# summary information about the df_label data frame
num_rows, num_features = df_label.shape
num_negative_labels = sum(df_label.mind_wandered_label == False)
num_positive_labels = sum(df_label.mind_wandered_label == True)

print('Number of trials in the label dataframe: ', num_rows)
print('Number of features in the label dataframe: ', num_features)
print('Number of negative labels: ', num_negative_labels)
print('Number of positive labels: ', num_positive_labels)

Number of trials in the label dataframe:  4076
Number of features in the label dataframe:  6
Number of negative labels:  2963
Number of positive labels:  1113


In [78]:
# finally, lets remove all of these report columns from the cleaned data frame
# we will designate this as df_clean2, which will have both experiment meta features cleaned, and columns associated with the output label removed
df_clean2 = df_clean1.copy()
df_clean2.rename(columns = report_columns_map, inplace=True)
df_clean2.drop(report_columns, axis=1, inplace=True)

In [79]:
num_rows, num_features = df_clean2.shape
num_participants = df_clean2.participant_id.unique().size
num_segments = df_clean2.segment_index.unique().size

print('Number of experimental trial/segment rows: ', num_rows)
print('Current number of feature columns: ', num_features)
print('Number of unique participants: ', num_participants)
print('Maximum number of trial/segments for a participant: ', num_segments)

Number of experimental trial/segment rows:  4076
Current number of feature columns:  115
Number of unique participants:  135
Maximum number of trial/segments for a participant:  57


# Data Exploration and Cleaning of Basic Features

We currently have 115 feature columns.  The first 6 columns are experimental metadata, about participant, trial/segment number, date and time of trial, length,
etc.  Next we are going to dig a bit deeper into the paper 2 reported features and results, and see if we have all of the same ones.  One thing
to note here.  In the paper 2 they reported using eye gaze date for 132 participants out of the original 140.  We seem to have 135 unique participants
in this data set.  So it is unclear what the extra 3 or missing 5 might be due, but we do have similar numbers here, so I'm assuming they either trimed
3 more for some reason for that paper.

As noted by previous work by student, paper 2 (Machine learning to build the model section) gives the following information about features:

- 4 sets of global features for each window
  1. eye movement descriptive features (48 features mentioned in paper)
  2. pupil diameter descriptive features
  3. blink features: number of blinks and mean blink duration
  4. miscellaneous gaze properties: number of saccades, horizontal saccade proportion, fixation dispersion and fixation duration / saccade duration ratio.
  
Each had compute the minimum, maximum, mean, median, standard deviation, skew, kurtosis and range. The paper says there were 62 global gaze features
computed (48 of which are eye movement descriptives).

They removed the mean blink duration because had more missing values for more than 10% of the instances.

The removed 29 features with a variance inflaction factor greater than 5.


Looking at the data we currently have, first 6 columns are the experiment metaparameter information.  Leaving 108 feature columns, which seems like a bit
more than they are describing in the paper 2.

Here is the list of the current 108 potential features:

In [80]:
df = df_clean2.copy()

In [81]:
for idx, col in enumerate(df.columns[6:]):
    #print('%02d: %s'% (idx+6, col))
    print("    '%s': 'x'," % (col))

    'ValidityRate': 'x',
    'PageFixations': 'x',
    'WindowFixations': 'x',
    'PageBlinks': 'x',
    'WindowBlinks': 'x',
    'BottomWindowBound_Page': 'x',
    'TopWindowBound_Page': 'x',
    'BottomWindowBound_Session': 'x',
    'TopWindowBound_Session': 'x',
    'FixDurN': 'x',
    'FixDurMed': 'x',
    'FixDurMean': 'x',
    'FixDurSD': 'x',
    'FixDurMin': 'x',
    'FixDurMax': 'x',
    'FixDurRange': 'x',
    'FixDurSkew': 'x',
    'FixDurKur': 'x',
    'FxDisp': 'x',
    'SacDurN': 'x',
    'SacDurMed': 'x',
    'SacDurMean': 'x',
    'SacDurSD': 'x',
    'SacDurMin': 'x',
    'SacDurMax': 'x',
    'SacDurRange': 'x',
    'SacDurSkew': 'x',
    'SacDurKur': 'x',
    'SacAmpN': 'x',
    'SacAmpMed': 'x',
    'SacAmpMean': 'x',
    'SacAmpSD': 'x',
    'SacAmpMin': 'x',
    'SacAmpMax': 'x',
    'SacAmpRange': 'x',
    'SacAmpSkew': 'x',
    'SacAmpKur': 'x',
    'SacAngAbsN': 'x',
    'SacAngAbsMed': 'x',
    'SacAngAbsMean': 'x',
    'SacAngAbsSD': 'x',
    'SacAngAbsMin':

So the features named FixDurX, SacDurX, SacAmpX, SacVelX, SacAngleRel and SacAngleAbs definitely correspondto the 6 eye movement descriptive features
mentioned of fixation duration, saccade duration, saccade amplitude, saccade velocity and relative and absolute saccade angle.  Each of these
6 measurements had 8 statistical computations calculated over the trial window: minimum, maximum, mean, median, stndard deviation, skew, kurtosis
and range, yielding a total of 48 eye descriptive features.

Lets pull out these first 48 features into a new data frame well designate df_features, and try and identify the 62 features mentioned in the paper.
While we are at it, we will normalize the names, and lets check for missing values and make sure that the data types look correct.  We will also remove
these from our temporary df dataframe as we add and work on them, to make it easier to identify left over features after we pull out the ones
listed in the paper.

There is actually a MeasureN feature, which is probably the sample size of the measurements used to calculate median, mean, etc.  The did not appear to
use the MeasureN in their data, so lets drop this for now. 

There is a FxDisp measure in the list of features between the fixation duration and the saccade duration.  It is unclear yet what this
measurement is.

In [82]:
eye_movement_descriptive_features_map = {
    'FixDurMed':      'fixation_duration_median',
    'FixDurMean':     'fixation_duration_mean',
    'FixDurSD':       'fixation_duration_standard_deviation',
    'FixDurMin':      'fixation_duration_minimum',
    'FixDurMax':      'fixation_duration_maximum',
    'FixDurRange':    'fixation_duration_range',
    'FixDurSkew':     'fixation_duration_skew',
    'FixDurKur':      'fixation_duration_kurtosis',
    'SacDurMed':      'saccade_duration_median',
    'SacDurMean':     'saccade_duration_mean',
    'SacDurSD':       'saccade_duration_standard_deviation',
    'SacDurMin':      'saccade_duration_minimum',
    'SacDurMax':      'saccade_duration_maximum',
    'SacDurRange':    'saccade_duration_range',
    'SacDurSkew':     'saccade_duration_skew',
    'SacDurKur':      'saccade_duration_kurtosis',
    'SacAmpMed':      'saccade_amplitude_median',
    'SacAmpMean':     'saccade_amplitude_mean',
    'SacAmpSD':       'saccade_amplitude_standard_deviation',
    'SacAmpMin':      'saccade_amplitude_minimum',
    'SacAmpMax':      'saccade_amplitude_maximum',
    'SacAmpRange':    'saccade_amplitude_range',
    'SacAmpSkew':     'saccade_amplitude_skew',
    'SacAmpKur':      'saccade_amplitude_kurtosis',
    'SacVelMed':      'saccade_velocity_median',
    'SacVelMean':     'saccade_velocity_mean',
    'SacVelSD':       'saccade_velocity_sd',
    'SacVelMin':      'saccade_velocity_min',
    'SacVelMax':      'saccade_velocity_max',
    'SacVelRange':    'saccade_velocity_range',
    'SacVelSkew':     'saccade_velocity_skew',
    'SacVelKur':      'saccade_velocity_kurtosis',
    'SacAngAbsMed':   'saccade_angle_absolute_median',
    'SacAngAbsMean':  'saccade_angle_absolute_mean',
    'SacAngAbsSD':    'saccade_angle_absolute_standard_deviation',
    'SacAngAbsMin':   'saccade_angle_absolute_minimum',
    'SacAngAbsMax':   'saccade_angle_absolute_maximum',
    'SacAngAbsRange': 'saccade_angle_absolute_range',
    'SacAngAbsSkew':  'saccade_angle_absolute_skew',
    'SacAngAbsKur':   'saccade_angle_absolute_kurtosis',
    'SacAngRelMed':   'saccade_angle_relative_median',
    'SacAngRelMean':  'saccade_angle_relative_mean',
    'SacAngRelSD':    'saccade_angle_relative_standard_deviation',
    'SacAngRelMin':   'saccade_angle_relative_minimum',
    'SacAngRelMax':   'saccade_angle_relative_maximum',
    'SacAngRelRange': 'saccade_angle_relative_range',
    'SacAngRelSkew':  'saccade_angle_relative_skew',
    'SacAngRelKur':   'saccade_angle_relative_kurtosis',
}

In [83]:
# rename columns in temporary dataframe
df.rename(columns = eye_movement_descriptive_features_map, inplace=True)

In [84]:
# make a copy of only these feature columns, so we can start createning new dataframe of only features interested in
df_features = df[eye_movement_descriptive_features_map.values()].copy()
df_features.shape

(4076, 48)

In [85]:
# now remove these features from the temporary data frame, to make it easier to see what is left over after extracting mentioned
# features from paper.
df.drop(eye_movement_descriptive_features_map.values(), axis=1, inplace=True)

At this point we have pulled out the first 48 features identified as the eye movement descriptive features, and we removed these columns from the
temporary dataframe.

Lets ensure that the datatypes look reasonable, and see if we need to deal with any missing data.

These all appear to be clean.  No missing data.  Durations are in ms here, so a mean fixation duration of 1/4 of a second seems correct with a maximum
fixation of just over 1/2 of a second.  Again we might want to document these in a data dictionary.  From the paper, durations are measured in ms times,
amplitude is measured in number of pixels between two fixations.  Velocity was calculated as amplitude divided by duration, so should be pixels / ms.
Since this is a derived feature, it should probably be fairly correlated with these measure it is derived from.  The angles look like they are in
degrees, as values seem to range from close to 0 up to about 360.  It is not completely clear to me what the difference might have been
between the relative and absolute angle of the saccade was here. The means and percentils are close, but they definitely look like
different measures.

In [86]:
df_features.describe()

Unnamed: 0,fixation_duration_median,fixation_duration_mean,fixation_duration_standard_deviation,fixation_duration_minimum,fixation_duration_maximum,fixation_duration_range,fixation_duration_skew,fixation_duration_kurtosis,saccade_duration_median,saccade_duration_mean,saccade_duration_standard_deviation,saccade_duration_minimum,saccade_duration_maximum,saccade_duration_range,saccade_duration_skew,saccade_duration_kurtosis,saccade_amplitude_median,saccade_amplitude_mean,saccade_amplitude_standard_deviation,saccade_amplitude_minimum,saccade_amplitude_maximum,saccade_amplitude_range,saccade_amplitude_skew,saccade_amplitude_kurtosis,saccade_velocity_median,saccade_velocity_mean,saccade_velocity_sd,saccade_velocity_min,saccade_velocity_max,saccade_velocity_range,saccade_velocity_skew,saccade_velocity_kurtosis,saccade_angle_absolute_median,saccade_angle_absolute_mean,saccade_angle_absolute_standard_deviation,saccade_angle_absolute_minimum,saccade_angle_absolute_maximum,saccade_angle_absolute_range,saccade_angle_absolute_skew,saccade_angle_absolute_kurtosis,saccade_angle_relative_median,saccade_angle_relative_mean,saccade_angle_relative_standard_deviation,saccade_angle_relative_minimum,saccade_angle_relative_maximum,saccade_angle_relative_range,saccade_angle_relative_skew,saccade_angle_relative_kurtosis
count,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0
mean,219.403944,244.034112,108.787621,122.50368,476.429342,351.639537,0.996244,1.298634,29.32571,65.326965,87.81747,11.512267,264.664902,258.862365,1.956464,4.18578,163.288298,245.568984,234.530286,69.223759,802.600796,735.131763,1.999277,4.302593,8.14074,8.244053,4.032931,2.051137,14.9395,12.888364,0.132923,0.272021,201.600177,194.144413,144.684971,7.241965,356.743806,349.965334,-0.1564,-1.211052,178.106495,178.970778,175.515431,2.961315,356.98809,353.783332,0.019102,-2.126824
std,45.870434,53.792348,62.189677,28.308532,189.840517,176.113809,0.795421,2.547685,25.710168,60.323696,106.662368,7.270993,262.022751,281.259216,0.858751,3.910018,50.654266,66.202164,71.000667,25.008923,205.177156,200.154889,0.854413,3.879658,3.114519,2.777068,1.280407,2.304753,4.225646,4.226886,0.816272,1.646024,94.823896,47.457778,18.510662,27.404695,11.97648,27.534653,0.627277,0.989999,147.376798,40.073046,11.249984,9.928564,10.321726,16.676733,0.629899,1.207821
min,117.0,127.142857,24.36938,83.0,175.0,66.0,-2.449136,-2.879022,8.0,9.4,0.377964,8.0,16.0,1.0,-2.645751,-3.333333,51.02439,57.246533,11.998179,0.952858,79.352713,37.386633,-2.329558,-3.306244,0.290926,0.66161,0.50904,0.002338,1.436354,1.276716,-2.329386,-3.294416,0.822356,26.512658,66.120587,0.0,185.675382,169.2308,-2.550109,-3.333013,1.524554,1.970569,8.129163,0.000245,22.18636,2.206239,-2.448342,-5.999967
25%,187.0,207.13125,69.185654,100.0,350.0,233.0,0.447695,-0.55931,17.0,30.1,26.531383,8.0,91.0,83.0,1.376895,0.989331,128.826568,201.686293,206.242848,57.060127,761.465834,690.56319,1.487635,1.23415,5.440435,5.850329,3.127325,0.461505,12.093436,9.854592,-0.42688,-0.881009,173.152773,162.108689,135.095013,0.484611,357.393788,353.994935,-0.540932,-1.805117,14.692647,153.054848,171.571887,0.527362,356.822727,353.430566,-0.373934,-2.540045
50%,209.0,233.5,92.849751,117.0,433.0,308.0,0.921989,0.49318,21.0,44.846154,54.285811,8.0,183.0,167.0,1.991773,3.741482,156.354574,242.451385,243.289083,67.393291,855.925005,783.316331,2.035969,3.432282,8.868506,8.663045,3.993678,0.970296,15.40155,12.932225,0.092462,-0.078608,177.51338,194.799032,147.04907,1.260044,358.823241,356.983735,-0.160512,-1.436767,179.62838,179.929086,177.899605,1.392556,358.636418,356.630884,0.000134,-2.282108
75%,242.0,268.75,128.551158,141.0,549.0,417.0,1.514626,2.496111,33.0,75.827922,100.994566,16.0,327.0,317.0,2.604248,7.174066,188.111155,285.168831,277.825728,81.837365,923.721106,858.40243,2.662692,7.555701,10.586773,10.49442,4.896642,3.004345,17.736699,15.8613,0.658521,1.06051,264.082083,227.076246,157.396156,2.976306,359.537315,358.482376,0.226959,-0.831148,345.431314,205.705618,181.892834,3.21926,359.469253,358.28037,0.377163,-2.046081
max,529.0,597.166667,609.742276,283.0,1925.0,1413.022926,3.6887,11.852581,400.0,568.2,1071.196854,91.0,2267.0,2400.0,3.988124,15.931546,790.690217,746.640449,478.892265,255.660528,1273.423648,1180.285433,3.85856,15.493315,16.598825,16.120969,10.397077,12.102106,40.141742,37.763217,2.969283,9.243417,358.909602,335.644424,194.792277,189.677904,359.999117,359.982688,2.397787,5.312167,358.831683,355.401129,204.797658,345.491279,359.999694,359.970053,2.449345,5.997423


Lets then continue on with the other features mentioned in paper 2.  The second set were described as pupil diameter descriptive features.
I think from the description there are simply the 8 statistical summaries of the pupil diameter measurement that we should expect here.
These appear to be the PupilsDiameterZN features, as the paper mentions that a z score of pupil diameter was calculated in order
to standardize this measure.

Lets normalize the names of these parameters, and pull them from the temporary df into the df_features.  And as before, lets check for
missing values and that data types look correct.  This bring the number of features we have extracted to 56.  And also, we have noted
that there was a count N for each of these 7 measures, so we have identified 7 features so far that we are not going to extract
to the features we use for this analysis and replicationl

In [87]:
pupil_diameter_descriptive_features_map = {
    'PupilDiametersZMed':   'pupil_diameter_median',
    'PupilDiametersZMean':  'pupil_diameter_mean',
    'PupilDiametersZSD':    'pupil_diameter_standard_deviation',
    'PupilDiametersZMin':   'pupil_diameter_minimum',
    'PupilDiametersZMax':   'pupil_diameter_maximum',
    'PupilDiametersZRange': 'pupil_diameter_range',
    'PupilDiametersZSkew':  'pupil_diameter_skew',
    'PupilDiametersZKur':   'pupil_diameter_kurtosis',
}

In [88]:
# rename these columns in temporary dataframe
df.rename(columns = pupil_diameter_descriptive_features_map, inplace=True)

In [89]:
# append these feature columns to our current df_features
#df_features.join(df)
df_features = df_features.join(df[pupil_diameter_descriptive_features_map.values()].copy())
df_features.shape

(4076, 56)

In [90]:
# now remove these features from the temporary data frame, to make it easier to see what is left over after extracting mentioned
# features from paper.
df.drop(pupil_diameter_descriptive_features_map.values(), axis=1, inplace=True)
df.shape

(4076, 59)

And as before, lets check these new 8 features for pupil diameter to see if any missing values, etc.

Since pupil diameter is a normalized z-score, these don't really have a unit, or I guess the original unit was probably mm, so they are normalized mm.

I am a little surprised here to see a negative value for median and mean.  The z scores seem to range from -3 to +3.  I would guess that the z score should
be positive, guess I might have to go back and refamiliarize myself with z-score calculation.  It looks like these may be a normal distribution, centered
around 0, with range from -3 to 3.  So z-scores about 0 here would indicate pupil diameters that were average across those observed.

In any case, also no missing items here, and we need a float type for these data, so these features look correct.

In [91]:
df_features[pupil_diameter_descriptive_features_map.values()].describe()

Unnamed: 0,pupil_diameter_median,pupil_diameter_mean,pupil_diameter_standard_deviation,pupil_diameter_minimum,pupil_diameter_maximum,pupil_diameter_range,pupil_diameter_skew,pupil_diameter_kurtosis
count,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0
mean,-0.315771,-0.311956,0.397219,-1.135466,0.557059,1.692525,0.073302,-0.235972
std,0.849375,0.835383,0.194162,0.887844,0.949528,0.751333,0.649009,1.33988
min,-3.384376,-3.161523,0.031728,-4.553955,-2.627697,0.099516,-3.230114,-2.494463
25%,-0.907863,-0.889797,0.259026,-1.749153,-0.055583,1.170338,-0.300039,-0.99236
50%,-0.341211,-0.340486,0.360294,-1.150425,0.512813,1.572328,0.065047,-0.554679
75%,0.239795,0.230524,0.494064,-0.541892,1.116296,2.090335,0.44566,0.088549
max,2.903637,3.209197,1.577418,2.748382,7.562523,8.11102,3.395078,14.183782


Next described were the blink features.  But it looks like only 2 absolute measures were made here, the number of blinks and the
mean blink duration.  This makes sense because unlike the others, a blink is a point event.  So they just counted the number of blinks.
Though I guess since there was a measure of blink duration, could have calculated median, standard deviation, etc. for the blink duration.

Actually there are mean, median, etc. for the BlinkDurX, and a BlinkDurN.  If we include all of these, that will bring up the total number of 
features to 65.  I am not sure why the descripancy here, I assume if the number of features reported as 62 is correct, they only used
the nuber of blinks and the mean blink duration.  So we have a choice to either just use the 2 that seem to have been indicated, or include
all of the blink duration measures.

I am going to include all of the blink duration measures here, including the BlinkDurN which should be the measure of the number of blinks.  This will
add 9 features to our features data frame, bringing total so far up to 65, with still one more group to go.  But we can easily still replicate the
feature set as described in paper 2 by just dropping the other blink duration features from df_features if needed later.

In [92]:
blink_features_map = {
    'BlinkDurN':     'number_of_blinks',
    'BlinkDurMed':   'blink_duration_median',
    'BlinkDurMean':  'blink_duration_mean',
    'BlinkDurSD':    'blink_duration_standard_deviation',
    'BlinkDurMin':   'blink_duration_minimum',
    'BlinkDurMax':   'blink_duration_maximum',
    'BlinkDurRange': 'blink_duration_range',
    'BlinkDurSkew':  'blink_duration_skew',
    'BlinkDurKur':   'blink_duration_kurtosis',
}

In [93]:
# rename these columns in temporary dataframe
df.rename(columns = blink_features_map, inplace=True)

In [94]:
# append these feature columns to our current df_features
#df_features.join(df)
df_features = df_features.join(df[blink_features_map.values()].copy())
df_features.shape

(4076, 65)

In [95]:
# now remove these features from the temporary data frame, to make it easier to see what is left over after extracting mentioned
# features from paper.
df.drop(blink_features_map.values(), axis=1, inplace=True)
df.shape

(4076, 50)

And lets examine these new features to see if any values are missing, and if we need any other cleaning.

In [96]:
df_features[blink_features_map.values()].describe()

Unnamed: 0,number_of_blinks,blink_duration_median,blink_duration_mean,blink_duration_standard_deviation,blink_duration_minimum,blink_duration_maximum,blink_duration_range,blink_duration_skew,blink_duration_kurtosis
count,4076.0,1816.0,1816.0,625.0,1816.0,1816.0,1816.0,182.0,40.0
mean,0.654503,161.071861,161.535169,31.901193,153.186123,170.496145,17.310022,0.20901,0.702644
std,0.888403,63.798246,63.641444,34.485271,63.561805,69.886898,39.826196,1.305798,2.866314
min,0.0,83.0,83.0,0.0,83.0,83.0,0.0,-1.732051,-6.0
25%,0.0,117.0,117.0,11.313709,108.0,125.0,0.0,-1.106108,-1.04876
50%,0.0,146.0,146.0,22.181073,134.0,150.0,0.0,0.405934,1.247536
75%,1.0,183.5,184.0,36.363902,175.0,200.0,17.0,1.466667,2.885738
max,6.0,400.0,400.0,212.132034,400.0,400.0,300.0,2.236068,5.0


There is a lot of missing data here.  For one the mean number of blinks is less than 1.  So most likely, the missing data is because all cases where the
number of blinks is 0, in which case you cannot calculate a mean, etc. of 0 values.  I was a bit surprised, expecting more blinks.  But the eye tracking
summary statistics are on average for like 10 to 15 seconds or so, the time from the start of the page, until 2 seconds before a mind wandering was reported
(or the average of this time for the trials without mind wandering).  It is probably not unusual to only have 1 or 2 blinks, and many times none, 
for a 10 second or so period.

Let me test that hypothesis.  I think that the count where number_of_blinks is 0 will end up being $4076 - 1816 = 2260$.  This is because when number
of blinks is 0 then we can't calculate a mean and other statistics.  Conversely, the trials where number of blinks is 1 or more should be 1816

In [97]:
# whenever no blink, should end up with NaN results for statistics, so expect 2260 trials with no blinks here
len( df_features[df_features.number_of_blinks == 0])

2260

In [98]:
# but number of trials where there are 1 or more blink can have a mean.  Though standard deviation isn't really meaningful for 1 value, so maybe this is
# why sd (and also skew and kurtosis) have even more missing values
len( df_features[df_features.number_of_blinks >= 1] )

1800

Ok the assumption is mostly correct, but I was expecting 1816 trials with 1 or more blinks.  Number of blinks is a float type, are there some
recorded number of blinks that are fractional, like between 0 and 1?  Ok those are the missing ones.  Was that bad date?

Lets look at those 16 trials.  Yes that appears to be the case.  Looks like all 16 of these have somehow gotten a fractional number_of_blinks.
But looking at the median, mean, minimum, maximum and standard deviation, it is clear these must have been trials with 1 blink.  Standard
deviation is NaN here because can't calculate it with only a single N sample.  But minimum and maximum are always the same, thus only a single
observation.

In [99]:
# whenever no blink, should end up with NaN results for statistics, so expect 2260 trials with no blinks here
len( df_features[(df_features.number_of_blinks > 0) & (df_features.number_of_blinks < 1)] )

16

In [100]:
df_features[(df_features.number_of_blinks > 0) & (df_features.number_of_blinks < 1)]

Unnamed: 0,fixation_duration_median,fixation_duration_mean,fixation_duration_standard_deviation,fixation_duration_minimum,fixation_duration_maximum,fixation_duration_range,fixation_duration_skew,fixation_duration_kurtosis,saccade_duration_median,saccade_duration_mean,saccade_duration_standard_deviation,saccade_duration_minimum,saccade_duration_maximum,saccade_duration_range,saccade_duration_skew,saccade_duration_kurtosis,saccade_amplitude_median,saccade_amplitude_mean,saccade_amplitude_standard_deviation,saccade_amplitude_minimum,saccade_amplitude_maximum,saccade_amplitude_range,saccade_amplitude_skew,saccade_amplitude_kurtosis,saccade_velocity_median,saccade_velocity_mean,saccade_velocity_sd,saccade_velocity_min,saccade_velocity_max,saccade_velocity_range,saccade_velocity_skew,saccade_velocity_kurtosis,saccade_angle_absolute_median,saccade_angle_absolute_mean,saccade_angle_absolute_standard_deviation,saccade_angle_absolute_minimum,saccade_angle_absolute_maximum,saccade_angle_absolute_range,saccade_angle_absolute_skew,saccade_angle_absolute_kurtosis,saccade_angle_relative_median,saccade_angle_relative_mean,saccade_angle_relative_standard_deviation,saccade_angle_relative_minimum,saccade_angle_relative_maximum,saccade_angle_relative_range,saccade_angle_relative_skew,saccade_angle_relative_kurtosis,pupil_diameter_median,pupil_diameter_mean,pupil_diameter_standard_deviation,pupil_diameter_minimum,pupil_diameter_maximum,pupil_diameter_range,pupil_diameter_skew,pupil_diameter_kurtosis,number_of_blinks,blink_duration_median,blink_duration_mean,blink_duration_standard_deviation,blink_duration_minimum,blink_duration_maximum,blink_duration_range,blink_duration_skew,blink_duration_kurtosis
299,192.0,235.857143,112.711178,134.0,450.0,316.0,1.414997,1.331512,25.0,27.666667,16.548917,8.0,50.0,42.0,0.253358,-1.836772,206.791453,255.353218,187.438601,103.12697,626.762008,523.635038,2.122674,4.898237,10.876436,10.133386,3.985703,4.351276,14.922905,10.571629,-0.415007,-1.16346,161.575534,146.25399,131.281609,4.907445,359.741422,354.833976,0.587906,0.405956,29.015632,151.567946,188.192147,3.766443,359.891629,356.125186,0.597983,-3.315961,0.782655,0.48459,0.653546,-0.757375,1.265378,2.022753,-0.667711,-1.258244,0.939146,92.0,92.0,,92.0,92.0,0.0,,
300,258.0,281.833333,145.641424,108.0,634.0,526.0,1.237721,2.091994,17.0,44.727273,72.252461,8.0,258.0,250.0,3.07682,9.795007,153.154364,276.965682,278.378396,85.843423,888.406662,802.563238,1.718866,1.715903,10.480121,10.163131,4.534333,0.418298,17.768133,17.349836,-0.566316,1.525239,179.410146,194.846978,145.079486,0.937055,357.631544,356.694489,-0.165338,-1.495254,8.941555,108.769773,166.385165,0.441142,357.99738,357.556238,1.038308,-1.200892,0.486177,0.417752,0.392569,-0.380058,1.127071,1.507129,-0.070422,-1.304915,0.939146,183.0,183.0,,183.0,183.0,0.0,,
308,200.0,252.615385,119.158535,125.0,508.0,383.0,1.216777,0.646134,25.0,40.916667,46.982508,8.0,183.0,175.0,2.927104,9.227718,231.084197,350.605657,285.406263,60.566519,937.533993,876.967474,1.533368,1.300929,10.598379,10.568869,4.387642,1.605081,18.11346,16.50838,-0.22206,0.671801,90.42628,120.522755,137.821174,1.094117,358.768485,357.674368,0.752012,-0.710893,354.270411,196.534492,184.452496,0.789281,359.634369,358.845088,-0.213049,-2.442981,0.072995,0.056513,0.577166,-1.004075,1.053539,2.057614,-0.042325,-1.148277,0.939146,117.0,117.0,,117.0,117.0,0.0,,
1156,283.0,258.909091,99.613708,83.0,400.0,317.0,-0.906977,0.300253,17.0,26.6,16.153431,16.0,67.0,51.0,2.003597,4.406114,137.051626,193.139406,229.614671,50.865256,833.256071,782.390815,2.92411,8.913628,5.180167,6.360732,2.992768,2.992074,12.436658,9.444584,1.3407,0.958969,178.316108,180.568347,163.741286,3.000575,358.488266,355.487691,0.011678,-2.125688,10.503548,122.61395,173.382656,0.172731,358.010913,357.838182,0.853828,-1.710181,0.284173,0.290219,0.122095,0.0556,0.757994,0.702393,0.784143,2.028136,0.603595,116.0,116.0,,116.0,116.0,0.0,,
1439,200.0,209.0,68.582068,100.0,316.0,216.0,0.116264,-0.596039,34.0,252.0,493.269558,16.0,1282.135421,1433.0,2.628378,7.065498,224.987181,284.675023,198.791207,70.732469,652.192573,581.460105,0.917656,0.044695,4.662241,5.189283,3.609404,0.131522,11.608158,11.476635,0.427366,0.31397,177.255119,213.994915,127.169767,1.931986,354.820849,352.888863,-0.227617,-0.76682,323.503455,206.444568,173.990607,7.151325,353.181427,346.030102,-0.386078,-2.697739,0.506828,0.44665,0.317483,-0.163821,0.993146,1.156967,-0.228452,-1.242003,0.89996,250.0,250.0,,250.0,250.0,0.0,,
1442,234.0,290.555556,137.935774,166.0,583.0,417.0,1.425428,1.543285,50.0,149.875,165.443506,16.0,450.0,434.0,1.019965,-0.445093,224.273383,274.549883,227.276597,14.090214,704.015098,689.924885,1.03576,0.564485,4.204737,5.013769,4.523981,0.046967,12.272532,12.225564,0.531349,-1.229756,242.339618,215.435528,145.864558,2.518068,354.074971,351.556903,-0.422014,-1.757154,43.773348,160.164036,161.556693,2.194157,354.021633,351.827476,0.367845,-2.666477,-0.596698,-0.652345,0.266629,-1.220114,0.053912,1.274026,-0.079837,-0.158776,0.89996,249.0,249.0,,249.0,249.0,0.0,,
1464,316.0,294.111111,76.577484,199.0,417.0,218.0,0.231473,-0.941048,33.0,56.375,73.667084,16.0,234.0,218.0,2.563664,6.785645,181.360382,247.546248,228.317952,113.964609,803.437354,689.472745,2.647577,7.244179,6.628632,7.047762,3.720147,0.653545,12.051965,11.39842,-0.109197,0.3446,354.231258,243.888782,163.047817,0.073718,359.784916,359.711198,-0.977268,-1.10146,354.426618,206.527101,188.342647,1.031631,358.935669,357.904038,-0.374476,-2.797662,0.940114,1.027221,0.377084,0.191554,2.079466,1.887912,0.954792,0.278843,0.89996,217.0,217.0,,217.0,217.0,0.0,,
1877,358.0,402.0,219.093587,200.0,850.0,650.0,1.366692,1.70527,83.0,64.0,34.229617,16.0,100.0,84.0,-0.888996,-1.076983,155.106824,143.004946,46.156453,82.819112,217.223455,134.404343,0.227157,-0.304731,2.172235,3.527126,3.15308,0.985942,9.694176,8.708235,1.610206,1.905347,167.149917,177.613418,173.269116,0.369634,356.076167,355.706533,0.041563,-2.596677,10.882094,118.819232,170.239955,5.081833,341.067646,335.985813,0.967942,-1.872728,-0.145567,-0.099077,0.29924,-0.729208,0.842225,1.571433,0.30378,-0.415048,0.475401,166.0,166.0,,166.0,166.0,0.0,,
2517,179.0,219.333333,130.440048,109.0,591.0,482.0,2.423134,6.582892,25.0,109.909091,185.829198,9.0,558.0,549.0,2.022062,3.076794,172.535407,215.409135,152.369901,68.83099,527.550658,458.719669,1.110517,0.109825,4.570244,6.088728,3.932975,0.720234,12.343413,11.623179,0.250726,-1.257574,173.864445,173.158023,139.551066,0.281678,355.223335,354.941657,0.12037,-1.592505,34.390921,144.279849,165.248985,5.165195,357.624591,352.459396,0.498584,-2.177765,-1.74489,-1.801091,0.559553,-3.203975,-0.509991,2.693984,0.117552,0.111294,0.788684,216.0,216.0,,216.0,216.0,0.0,,
3723,187.5,261.375,175.708312,92.0,533.0,441.0,1.002319,-0.696505,92.0,227.428571,253.462592,8.0,608.0,600.0,0.562366,-1.851151,122.816055,180.446728,126.134177,43.228219,349.560613,306.332394,0.311873,-2.226494,1.334957,4.626232,6.080809,0.22755,16.885719,16.658169,1.690409,2.681074,174.083168,156.449282,146.118298,1.321254,359.333776,358.012523,0.356835,-1.556719,285.257384,209.404589,161.590711,0.36872,354.749391,354.380671,-0.845495,-1.862574,0.861825,0.93118,0.317956,0.254279,2.209315,1.955035,0.262691,0.507401,0.844047,391.0,391.0,,391.0,391.0,0.0,,


At this point I am going to back off from our original thoughts.  Lets restore all of these columns back to the temporary df.
Since there are so many trials with only 0 or 1 blink, thus all the missing values, lets also only pull in number of
blinks and the blink duration mean.  It is reasonable to use 0 for the blink duration mean when there are 0 blinks, which I assume
it probably what paper 2 did here.  So we will just pull over these two features, and fill in missing blink duration means as 0.
And also, we will fix the 16 trials with a fractional number of blinks, and set them all to 1 as that appears to be the correct
value.  And we can then make the number_of_blinks feature an integer value.

In [101]:
# start by backing off processing the number of blinks and blink duration features. 
# put them back into the temporary dataframe and remove them from the df_features
# should end up with the 9 features back into df, for a total again of 59
df = df.join(df_features[blink_features_map.values()].copy())
df.shape

(4076, 59)

In [102]:
# and drop these features out of df_features, bringing us back to the
# 7x8 = 56 features before we started looking at number of blinks
df_features.drop(blink_features_map.values(), axis=1, inplace=True)
df_features.shape

(4076, 56)

In [103]:
# now lets only add in the number_of_blinks and blink_duration_mean to df_features
blink_features = ['number_of_blinks', 'blink_duration_mean']
df_features = df_features.join(df[blink_features].copy())
df_features.shape

(4076, 58)

In [104]:
# make all number of blinks between 0 and 1 have a value of 1
#sum( (df_features.number_of_blinks > 0) & (df_features.number_of_blinks < 1) )
df_features[ (df_features.number_of_blinks > 0) & (df_features.number_of_blinks < 1) ] = 1.0

# and make it into an int value
df_features.number_of_blinks = df_features.number_of_blinks.astype(int)

print(df_features.number_of_blinks.dtype)
print( len( df_features[ (df_features.number_of_blinks > 0) & (df_features.number_of_blinks < 1) ] ) )

int64
0


In [105]:
# and lets remove those two features we have now processed from the temporary data frame
df.drop(blink_features, axis=1, inplace=True)
df.shape

(4076, 57)

At this point we have processed 58 features mentioned in paper 2, and have the 4th group, the miscellaneous gaze properties, to look at.

Again 4 measurements are listed for miscellaneous gaze properties: number of saccades, horizontal saccade proportion, fixation dispersion
and fixation duration / saccade duration ratio.

Looking at our column feature names remaining, we can probably use any of the SacDurN, SacVelN, etc.  In fact it might be a good idea to check that these
are all the same / redundant in original data, just to make sure?

The other measurements seem to be named FixDisp, horizontalSaccadeProp and FxSacRatio respectively.  I had mentioned above that I was unsure what
FixDisp might be, it was in with the group of FixDur measurements, so it is fixation dispersion.

Lets pull out these 4 features and check for missing values.  This will make our total 62, just as mentioned in paper 2.

First lets just confirm that the N measurements for saccades seem to be all the same, just to make sure we are understanding the data properly.

In [106]:
# we have not yet renamed these from the original column names.  But lets
# first just check that they all seem to record the same number of saccads for each
# trial
saccade_feature_list = ['SacDurN', 'SacAmpN', 'SacVelN', 'SacAngRelN', 'SacAngAbsN']

df[saccade_feature_list]

Unnamed: 0,SacDurN,SacAmpN,SacVelN,SacAngRelN,SacAngAbsN
1,10.0,10.0,10.0,9.0,10.0
2,10.0,10.0,10.0,9.0,10.0
3,12.0,12.0,12.0,11.0,12.0
4,13.0,13.0,13.0,12.0,13.0
5,11.0,11.0,11.0,10.0,11.0
...,...,...,...,...,...
4072,14.0,14.0,14.0,13.0,14.0
4073,15.0,15.0,15.0,14.0,15.0
4074,13.0,13.0,13.0,12.0,13.0
4075,16.0,16.0,16.0,15.0,16.0


So we can see from quick look that not all saccade counts were equal.  This most likely means that the calculation cannot be done for some of these.  From
the values we see, the duration, amplitude and velocity might all be equivalent, it is only when calculating angles that sometimes this calculation
can not be done?  This would make sense, and if true, we can use any of the first 3 N values for number_of_saccads

Lets test that the first 3 are all redundant.

In [107]:
# try again but only first 3 measurements
saccade_feature_list = ['SacDurN', 'SacAmpN', 'SacVelN']

#df[saccade_feature_list]

# count up number of trials where all 3 are equal.
sum( (df.SacDurN == df.SacAmpN) & (df.SacAmpN == df.SacVelN) )

4064

Close, but there are 12 rows where the values of these 3 differ.  Lets examine those 12 rows

In [108]:
df[(df.SacDurN != df.SacAmpN) | (df.SacAmpN != df.SacVelN) | (df.SacDurN != df.SacVelN)].loc[:,saccade_feature_list]

Unnamed: 0,SacDurN,SacAmpN,SacVelN
61,7.0,7.598867,7.0
512,5.0,6.256247,5.0
1003,5.0,7.636825,5.0
1633,8.0,8.463177,8.0
2309,17.0,16.482372,17.0
2456,7.0,7.742345,7.0
2745,5.0,5.535021,5.0
3121,6.0,6.300095,6.0
3246,8.0,8.455771,8.0
3489,7.0,7.216581,7.0


The problem is in the saccade amplitude, which has some fractional numbers, which should not be correct.  However those got there, it appears
that the true value of number_of_saccads is probably either the saccade duration N or the saccade velocity N.

Lets just use SacDurN then for this measure, and convert it to an int in our fields.  And we will use the other 3 identified features and pull them out,
then as usual look for more missing values and see if we need to clean further.

These 4 new features look clean again, no missing values.  And we used a column that already had correct number of saccades count, so we could
convert it to a whole number int type.

In [109]:
miscellaneous_features_map = {
    'SacDurN':               'number_of_saccades',
    'horizontalSaccadeProp': 'horizontal_saccade_proportion',
    'FxDisp':                 'fixation_dispersion',
    'FxSacRatio':            'fixation_saccade_durtion_ratio',
}

In [110]:
# rename these columns in temporary dataframe
df.rename(columns = miscellaneous_features_map, inplace=True)

In [111]:
# append these feature columns to our current df_features
df_features = df_features.join(df[miscellaneous_features_map.values()].copy())
df_features.shape

(4076, 62)

In [112]:
# now remove these features from the temporary data frame, to make it easier to see what is left over after extracting mentioned
# features from paper.
df.drop(miscellaneous_features_map.values(), axis=1, inplace=True)
df.shape

(4076, 53)

In [113]:
# look at the properties of the misecllaneous features
df_features[miscellaneous_features_map.values()].describe()

Unnamed: 0,number_of_saccades,horizontal_saccade_proportion,fixation_dispersion,fixation_saccade_durtion_ratio
count,4076.0,4076.0,4076.0,4076.0
mean,11.165849,0.964906,0.477894,6.593431
std,2.901655,0.076675,0.071925,4.33707
min,5.0,0.375,0.268,0.328
25%,9.0,1.0,0.425,3.32275
50%,11.0,1.0,0.477,5.8575
75%,13.0,1.0,0.528,8.92575
max,20.0,1.0,0.736342,61.776


In [114]:
# clean number of saccades, making values into integers since they should be whole numbers here
df_features.number_of_saccades = df_features.number_of_saccades.astype(int)
df_features.number_of_saccades.dtype

dtype('int64')

# Exploration of Unused Features

At this point we have completed a df_features data frame which seems to have exactly the same 62 measured features described in paper 2.
All missing data has been removed.  We only had to fix a few for missing data or other bad values.

We also have a df_label dataframe into which we extracted features that need to be the labels or targets of any classifiers we build, including
the new mind_wandering_label feature which we will use for binary classification models.

Before we move on, lets look at the remaining features that seem to have not been used in paper 2.  We have 53 features remaining, though the first 6
are metadata.  Lets create another data frame called df_experiment_metadata to hold the official metadata information if we need, and remove these columns so we
can concentrate on the remaining 47 features.

In [115]:
# first 6 columns are the experiment metadata features, extract to a new dataframe called df_experiment_metadata and remove from the temporary
# working df
experiment_metadata_features = df.columns[:6]

# make a copy of these columns and create new dataframe
df_experiment_metadata = df[experiment_metadata_features].copy()
print(df_experiment_metadata.shape)
df_experiment_metadata.head()

(4076, 6)


Unnamed: 0,participant_id,participant_location,segment_index,start_time,end_time,trial_length
1,1002-UM,UM,57,2013-10-18 18:34:54.808,2013-10-18 18:35:22.271,27463
2,1002-UM,UM,56,2013-10-18 18:34:34.590,2013-10-18 18:34:54.808,20218
3,1002-UM,UM,55,2013-10-18 18:34:11.125,2013-10-18 18:34:34.590,23465
4,1002-UM,UM,54,2013-10-18 18:33:47.726,2013-10-18 18:34:11.125,23399
5,1002-UM,UM,53,2013-10-18 18:33:27.674,2013-10-18 18:33:47.726,20052


In [116]:
# remove those fields from the temporary df 
df.drop(experiment_metadata_features, axis=1, inplace=True)
df.shape

(4076, 47)

Lets list these feature column names again.  We know the purpose of some of these already.  The measureN columns are simply the number of measures for
some of the eye tracking statistics that were calculated.  Those are not going to be needed.  Likewise the blink_duration statistics we already
discussed above.  For many trials the blinks were 0, so some of these statistics are meaningless, and when blink is 1, mean and meadian make sense, but
not standard deviation and maybe other statistics.  It is not likely these will be useful.

Lets remove all of these features and look at the remaining ones.


In [117]:
for idx, col in enumerate(df.columns):
    print('%02d: %s' % (idx, col))

00: ValidityRate
01: PageFixations
02: WindowFixations
03: PageBlinks
04: WindowBlinks
05: BottomWindowBound_Page
06: TopWindowBound_Page
07: BottomWindowBound_Session
08: TopWindowBound_Session
09: FixDurN
10: SacAmpN
11: SacAngAbsN
12: SacAngRelN
13: SacVelN
14: PupilDiametersZN
15: FirstPassFixDurMean
16: FirstPassFixDurSD
17: FirstPassFixProp
18: EndOfClauseFixDurMean
19: EndOfClauseFixDurSD
20: EndOfClauseFixProp
21: RegFixDurMean
22: RegFixDurSD
23: RegFixProp
24: SingleFixDurMean
25: SingleFixDurSD
26: SingleFixProp
27: NoWordFixDurMean
28: NoWordFixDurSD
29: NoWordFixProp
30: GazeFixDurMean
31: GazeFixDurSD
32: GazeFixProp
33: WordSkipProp
34: propCrossLineSaccades
35: readingDepth
36: WordLenToFixDurCorr
37: FreqToFixDurCorr
38: NumSynsToFixDurCorr
39: HypDepthToFixDurCorr
40: blink_duration_median
41: blink_duration_standard_deviation
42: blink_duration_minimum
43: blink_duration_maximum
44: blink_duration_range
45: blink_duration_skew
46: blink_duration_kurtosis


In [118]:
explained_features = ['FixDurN', 'SacAmpN', 'SacAngAbsN', 'SacAngRelN', 'SacVelN', 'PupilDiametersZN', 
                      'blink_duration_median', 'blink_duration_standard_deviation', 
                      'blink_duration_minimum', 'blink_duration_maximum', 'blink_duration_range', 
                      'blink_duration_skew', 'blink_duration_kurtosis']

# remove these explained / uninteresting features from the data now
df.drop(explained_features, axis=1, inplace=True)
df.shape

(4076, 34)

These leaves us with the following 34 features.  Some of these we can speculate on their meaning and possible use.  We could also ask the
original data generators if we think they might be of interest.

There are plenty of other eye tracking measurements.  Lots of them appear to be measures of fixations for various conditions.  I would guess 
window fixations and window blinks are counts of the number of fixations and blinks that occured on the whole screen, or maybe on the open window of the experiment.  It is unusual in experiments to have them cluttered with other windows for a GUI windowing environment, usually the experiment is run full screen.
So I think the window measure probably are just when looking at the screen.

The page fixations and blinks may measure those counts only when looking at the page they are reading.  Presumably the experiment did have extra information 
other than the page of text to be read, so the page of text was a sub area of the total screen.

Others like end of clause fixations, reg? fixation, single fixation, no word fixation are all measure of fixations for some special condition or particular
area of the screen during experiment.  They do not have the full 8 statistics for most of these, usually only mean, standard deviation, and
proportion?, nor the count of the number of these fixation types.  I assume since there are means and standard deviatins, these are measured
statistics over the trial window of fixations of some subarea.  I would also assume that the full 8 statistics could probably be calculated for most of 
these if wanted, but for this experiment they had decided not to use them.

That would account for most of this set of 33.  There are a few that are not fixation or blink measures.  These include:
ValidityRate, BottomWindowBound_page, TopWindowBound_Page, BottomWindowBound_Session, TopWindowBound_Session,
WordSkipProp, readingDepth

The top and bottom bounds are likely measures of positions of items on screen during experiment.  They may be needed to calculate
for example bounding boxes or areas to calculate some of the other measures with.

ValidityRate, WordSkipProp and readingDepth might all be interesting features to use.  Not sure what the validity rate might be here.
WordSkipProp might be a measure of the proportion of words skiped over by fixations while reading.  Not sure of readingDepth, could they
have done some measure of subjects individual differences, and this is a measure of reading ability?  If so it is the only measure of
individual ability that I seem to see in the data.

In [119]:
for idx, col in enumerate(df.columns):
    print('%02d: %s' % (idx, col))

00: ValidityRate
01: PageFixations
02: WindowFixations
03: PageBlinks
04: WindowBlinks
05: BottomWindowBound_Page
06: TopWindowBound_Page
07: BottomWindowBound_Session
08: TopWindowBound_Session
09: FirstPassFixDurMean
10: FirstPassFixDurSD
11: FirstPassFixProp
12: EndOfClauseFixDurMean
13: EndOfClauseFixDurSD
14: EndOfClauseFixProp
15: RegFixDurMean
16: RegFixDurSD
17: RegFixProp
18: SingleFixDurMean
19: SingleFixDurSD
20: SingleFixProp
21: NoWordFixDurMean
22: NoWordFixDurSD
23: NoWordFixProp
24: GazeFixDurMean
25: GazeFixDurSD
26: GazeFixProp
27: WordSkipProp
28: propCrossLineSaccades
29: readingDepth
30: WordLenToFixDurCorr
31: FreqToFixDurCorr
32: NumSynsToFixDurCorr
33: HypDepthToFixDurCorr


# Conclusion

This notebook ended up with the goal of exploring and cleaning the raw mindwandering dataset we have.  And in particular, we targeted finding and
cleaning the features used in the paper 2 reference article that used this same data for the reported results.

From exploration, we appear to have 4076 clean trials in this data, performed by 135 unique participants in this data set.  As mentioned
above, the paper 2 indicates 132 participants in the work reported, so there may be a slight descrepancy there.  The number of trials per
participant ranged from a low of 4, to a full 57 (as reported in the article, each subject was given 57 pages of text to read).

The 62 features used in paper 2 appear to be present.  We are confident from this notebook the the 62 features in the resulting df_fields
dataframe should be pretty similar to the set used for the work reported in paper 2.  The measurements needed little cleaning, most
had no missing values and seemed reasonable.  Only 1 or two fields were filled, for example the blink_duration_mean was filled with 0
values whenever there were 0 blinks during a recorded trial.  Also of note, all features were numeric in this data set.  Some are
whole numbers, like the saccade counts and blink counts.  But no categorical variables are present in this data.

The following data frames were extracted that have cleaned data that may be of use for modeling or other activities:

- **df_experiment_metadata**: Dataframe of the experiment metadata fields, including participant id, location, and times and dates
  when each trial began and ended.  This information should not be needed for any model training, but we may need especially begin
  and start times, or probably most likely the segment id, if we want to stitch these together into a time series of consecutive
  trials for an RNN input.
- **df_label**: Dataframe of columns / fields that are really results or labels that we might want to create a classifier for.
  We derived a mind_wandered_label binary category in this dataframe from the number of mind wandering reports fields, 
  and teh report type field.  Other fields are label information that should be used for supervised learning, and not as inputs
  to any model.
 - **df_features**: Specifically we ended up extracting the 62 features that appear to have been used in paper 2 into this dataframe.
   The data values have not been normalized yet, and little cleaning or fillling in of missing data was needed.

As a next step, we should extract the cleaning steps for the cleaned dataframes above into a ScikitLearn pipelines, and maybe put these into
script files for easy reuse.  Also in this next step, we should create additional pipelines to normalize the range of the data for
machine learning algorithms that are sensitive to magnitude differences of features.