# iCatcher annotations --> looking time pipeline

This script, and the helper classes that it calls, are based on Raz et al., 2024 (https://osf.io/ndkt6/), and organizes iCatcher annotations and other subject and trial level data into a format that can feed into our R looking time pipeline. It does this by organizing subject, trial, and look level data in the same format as it is usually outputed in by Datavyu manual coding. 

## Inputs: 

### 1. Lookit study-level JSON file: 

We parse the Lookit study-wide .json log (that is stored at /data/metadata/lookit_study.json), to obtain trial onsets/offsets, trial data, subject data and session data, all in one go.
    
### 2. iCatcher annotation file/s (.npz file per subject, per session)

These are the main outputs from running iCatcher. Expects one file per trial per subject, per session, named accordingly. 

### 3. raw video files (or video-parsed .json files)

needed to extract frame rates for conversion from frame rates to ms (to get look events onset/offset relative to start of video)
    
if this conversion has already been run, .json files with the relevant information should exist in the videos directory. if not, these .json files will be written in the process of obtaining this info 
   
Since we have a separate video for each trial we are not worried about calculating the exact trial onset time. However, we do store the lag between the onset of the audio and the video recording on each trial to use in our main looking time analysis in R. The original Raz et al., 2024 code (https://osf.io/ndkt6/) provides detailed instructions on how to deal with trial onsets relative to the start of videos, both when a manual onset CSV file is required and is not. The original code also provides flexibility with subject level and experiment onset information.
    

In [116]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## import libraries 

In [117]:
import os
import os.path as op
from pathlib import Path
import sys
import pandas as pd
import numpy as np

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from helperfuncs.video_framerates import get_frame_information
from helperfuncs.lookit_json_parser import get_lookit_trial_times

from config import *

## Set relevant paths

In [124]:
raw_data_dir = op.join(SERVER_PATH, 'data', 'raw')
data_dir = op.join(PROJECT_PATH, 'data', PROJECT_VERSION)
metadata_dir = op.join(PROJECT_PATH, 'data', 'metadata')

# where the raw iCatcher outputs are stored
icatcher_outputs_dir = op.join(raw_data_dir, 'icatcher_annotations')
# where the raw videos are stored
videos_dir = op.join(raw_data_dir, 'original_videos', 'mp4') 

# where the raw Lookit JSON is stored - thisonly raw data file that is stored locally 
lookit_json = op.join(PROJECT_PATH, 'data', 'raw', 'lookit','lookit_study.json')

# where the trial timing info, for this specific project version, processed from the raw Lookit JSON is stored
lookit_trial_info_csv = op.join(data_dir, 'data_to_analyze', 'lookit_trial_timing_info.csv')


### Create all necessary functions

#### get all non-hidden files in dir (helper function):

In [125]:
# list all files except those beginning with '.' i.e., hidden files 
def listdir_nohidden(path):
    for root, dirs, files in os.walk(path):  # Walk through each directory and subdirectory
        for f in files:
            if not f.startswith('.') and f.endswith('.npz'):  # Check conditions
                yield os.path.relpath(os.path.join(root, f), start=path) 

## Set parameters for manual error exclusions


In [126]:
include_manual_edits = 0

if include_manual_edits:
    manually_coded_sections = pd.read_csv("manual_coding_timestamps.csv")

#### convert iCatcher annotated look-events from frame-wise to timing (ms from video onset) 

In [127]:
def subsample_data(data, target_length):
    """
    Subsample the input data to match the target length.
    
    Parameters:
        data (array-like): The input data to subsample.
        target_length (int): The desired length of the output data.
    
    Returns:
        array-like: Subsampled data.
    """
    factor = len(data) / target_length
    indices = np.round(np.arange(0, len(data), factor)).astype(int)[:target_length]
    return data[indices]

def read_convert_output(filename, stamps,manual_edits_df=None):
    """
    Given an npz file containing icatcher annotated frames and looks,
    converts to pandas DataFrame with another column mapping each frame
    to its time stamp in the video
    
    INPUTS: 
    filename (string): name of tabulated iCatcher output file in format
    '[CHILD_ID]/[TRIAL_ID]_[CHILD_ID].npz'
    stamps (List[int]): time stamp for each frame, where stamps[i] is the 
    time stamp at frame i (determined in function get_frame_information(), IMPORTED function)
    
    OUTPUTS: 
    rtype: DataFrame
    
    """
    npz = np.load(filename)
    df = pd.DataFrame([])
    lst = npz.files
    npz_data = npz[lst[0]]
    confidence_data = npz[lst[1]]

    # Align frame rates of iCatcher and mp4 videos, frame rates coming in are variable because of mp4 conversion
    if len(npz_data) != len(stamps):
        if len(npz_data) > len(stamps):
            npz_data = subsample_data(npz_data, len(stamps))
            confidence_data = subsample_data(confidence_data, len(stamps))
        else:
            stamps = subsample_data(stamps, len(npz_data))
    
    df['frame'] = range(1, len(npz_data) + 1)
    df['lookType_coded'] = npz_data
    # {'noface': -2, 'nobabyface': -1, 'away': 0, 'left': 1, 'right': 2}
    df['lookType_coded'] = df['lookType_coded'].clip(lower=0)
    # 'left' is coded as 1, 'right' as 2, we need to switch those since the video coming in from iCatcher+ is mirrored
    df['lookType_coded'] = df['lookType_coded'].replace({1: 'right', 2: 'left', 0: 'away'})
    
    # convert frames to ms using frame rate
    df['time_ms'] = stamps
    df['time_ms'] = df['time_ms'].astype(int)
    
    df['confidence'] = confidence_data
    
    # SET left/right/away error FRAMES based on manual indexing, from CSV  
    if include_manual_edits: 
        if len(manual_edits_df):
            for index, row in manual_edits_df.iterrows():
                onset = row['onset']
                offset = row['offset']
                df.loc[onset:offset, 'lookType_coded'] = row['value']
                df.loc[onset:offset, 'confidence'] = -1

    # split into dfs based on when the change happens
    df['group'] = df['lookType_coded'].ne(df['lookType_coded'].shift()).cumsum()
    df_grps = df.groupby('group')
    
    dfs = []
    for _, data in df_grps:
        dfs.append(data)


    looks_onoff_grouped = pd.DataFrame()
    
    for grpI in range(len(dfs)):
        indices = dfs[grpI].index.tolist()
        # first row time_ms is onset, row after last row time_ms is offset 
        onset = df.iloc[indices[0]].time_ms

        if grpI == len(dfs)-1:
            # assume dur of last frame is the mean dur of frames 
            dur = np.floor(np.mean(np.diff(df['time_ms'])))
            offset = onset + dur
        else:
            offset = df.iloc[indices[-1]+1].time_ms
        
        lookType = df.iloc[indices[0]].lookType_coded
        confidence = np.mean(df.iloc[indices].confidence)
    
        cur_lookinfo = pd.DataFrame({"Looks.ordinal": grpI,
                    "Looks.onset" : onset,
                    "Looks.offset": offset, 
                    "Looks.lookType": lookType,
                    "Looks.confidence": confidence},  index=[0])

        looks_onoff_grouped = pd.concat([looks_onoff_grouped, cur_lookinfo], ignore_index=True)


    
    return [looks_onoff_grouped, df]

In [128]:
def get_lookit_info(child_id, trial_id, session_id, trial_info_file, lookit_json):
    
    if Path(lookit_trial_info_csv).is_file(): # check whether lookit file already parsed
        df = pd.read_csv(trial_info_file)
            
    else: # otherwise, parse and save out relevant info  
        os.makedirs(os.path.dirname(lookit_trial_info_csv), exist_ok=True)
        df = get_lookit_trial_times(lookit_json)
        df.to_csv(lookit_trial_info_csv)
        
    
    # get part of df from current child
    df = df[df['SubjectInfo.subjID'] == child_id] 

    # get part of df from current trial
    df = df[df['Trials.trialID'] == trial_id]

    if 'SubjectInfo.sessionNumber' in df:
        df = df[df['SubjectInfo.sessionNumber'] == session_id] 
    else:
        df['SubjectInfo.sessionNumber'] = 1

    df = df.reset_index(drop=True)

    return df

In [129]:
get_lookit_trial_times(lookit_json=lookit_json)

/Users/visuallearninglab/Documents/visvocab
main
/Volumes/vislearnlab/experiments/visvocab
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
dict_keys(['response', 'consent', 'study', 'participant', 'child', 'exp_data'])
49-great-job
2-study-intro
3-video-config
4-video-consent
8-exp-get-ready
7-webcam-display
12-easy-snail-cow
[{'SubjectInfo.subjID': '6THVBG', 'Trials.trialID': 'easy-snail-cow', 'absolute_onset': datetime.datetime(2025, 1, 12, 15, 18, 10, 756000), 'Trials.audio_lag_vs_video_lag': datetime.timedelta(microseconds=208000), 'absolute_offset': datetime.datetime(2025, 1, 12, 15, 18, 18, 225000), 'SubjectInfo.testAge': '510', 'Subj

Unnamed: 0,SubjectInfo.subjID,Trials.trialID,absolute_onset,Trials.audio_lag_vs_video_lag,absolute_offset,SubjectInfo.testAge,SubjectInfo.gender,Trials.leftImage,Trials.rightImage,Trials.targetImage,Trials.targetAudio,Trials.trialType,Trials.carrier_onset,Trials.target_onset,Trials.target_offset,Trials.order,SubjectInfo.age_at_birth,SubjectInfo.language_list,SubjectInfo.condition_list,Trials.ordinal
0,CKRG26,hard-bulldozer-tractor-distractor,2025-01-26 01:29:27.506,383.0,2025-01-26 01:29:35.300,630,f,tractor,bulldozer,tractor,see,hard-distractor,2,2.980748,3.960068,10,40 or more weeks,en,,1
0,dKKWAR,hard-turkey-goat-distractor,2025-01-23 19:26:07.634,1.0,2025-01-23 19:26:15.214,450,m,goat,turkey,goat,find,hard-distractor,2,2.980748,3.860907,10,40 or more weeks,en,,1
0,XMJ25D,easy-turkey-swan,2025-01-23 19:44:23.265,188.0,2025-01-23 19:44:30.713,690,m,swan,turkey,turkey,look_at,easy,2,2.980748,3.960068,10,39 weeks,en,,1
0,GNXMNQ,easy-potato-glasses-distractor,2025-01-24 00:43:51.112,35.0,2025-01-24 00:43:58.400,660,m,potato,glasses,glasses,see,easy-distractor,2,2.980748,3.960068,10,37 weeks,en te ur,,1
0,d4HUR3,easy-turtle-horse-distractor,2025-01-24 02:47:22.675,392.0,2025-01-24 02:47:30.308,480,f,horse,turtle,horse,where,easy-distractor,2,2.980748,3.860907,10,39 weeks,en,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,5QCS4M,hard-snail-worm,2025-01-25 20:54:06.501,115.0,2025-01-25 20:54:13.881,750,f,worm,snail,snail,where,hard,2,2.980748,3.860907,48,39 weeks,en ko,,32
0,CKRG26,hard-snail-worm,2025-01-26 01:35:23.116,306.0,2025-01-26 01:35:30.749,630,f,worm,snail,snail,see,hard,2,2.980748,3.860907,48,40 or more weeks,en,,32
0,FHRCdd,easy-potato-glasses-distractor,2025-01-24 17:34:32.680,167.0,2025-01-24 17:34:40.082,540,m,potato,glasses,glasses,find,easy-distractor,2,2.980748,3.960068,48,39 weeks,en ur,,32
0,GNXMNQ,hard-bulldozer-tractor,2025-01-24 00:48:51.853,37.0,2025-01-24 00:48:59.140,660,m,tractor,bulldozer,bulldozer,find,hard,2,2.980748,4.103515,48,37 weeks,en te ur,,32


#### get trial onsets w/r/t video

In [111]:
def get_trial_sets(child_id, trial_id, session_id, trial_info_file):
    """
    Finds corresponding trial info 
    and returns a list of [onset, offset] times for each trial in 
    milliseconds, with respect to video onset

    """

    lookit_df = get_lookit_info(child_id, trial_id, session_id, trial_info_file, lookit_json)
    df = lookit_df[[col for col in lookit_df.columns if col.startswith('Trials.')]]    
    df = df.copy()
    df['Trials.ordinal'] = df.index.values.tolist()
    df_sets = df[['Trials.audio_lag_vs_video_lag']]
    df_sets = df_sets.copy()
    df_sets.loc[:, :] = df_sets.dropna()
    trial_sets = []
    for _, trial in df_sets.iterrows():
        trial_sets.append([int(trial['Trials.audio_lag_vs_video_lag'])])
    def unique(sequence):
        seen = set()
        return [x for x in sequence if not (tuple(x) in seen or seen.add(tuple(x)))]

    return unique(trial_sets), df

#### Get subject level data 

In [112]:
def get_subject_info(child_id, trial_id, session_id, trial_info_file): 
        
    lookit_df = get_lookit_info(child_id, trial_id, session_id, trial_info_file, lookit_json)
    df = lookit_df[[col for col in lookit_df.columns if col.startswith('SubjectInfo.')]]       
    if 'SubjectInfo.sessionNumber' in df:
        df = df[df['SubjectInfo.sessionNumber'] == session_id] 
    else:
        df['SubjectInfo.sessionNumber'] = 1
        
    return df

In [113]:
print(icatcher_outputs_dir)

/Volumes/vislearnlab/experiments/visvocab/data/raw/icatcher_annotations


### main function: run processes 

In [134]:
# set up output name, saving a single file for each session of the experiment
output_data_dir = op.join(PROJECT_PATH, 'data', PROJECT_VERSION, 'data_to_analyze')
fname_final_output = op.join(output_data_dir, f'processed_icatcher.csv')
# create a list of participants that have already been processed for efficiency
if op.exists(fname_final_output):
    existing_df = pd.read_csv(fname_final_output)
    processed_participants = list(zip(existing_df['SubjectInfo.subjID'], existing_df['Trials.trialID']))
else:
    processed_participants = []

for filename in listdir_nohidden(icatcher_outputs_dir):
    # Only include files that are placed in folders set up in a directory for a particular child
    if '/' in filename and ('easy' in filename or 'hard' in filename):
        # Split into child_id and trial id
        child_id, trial_id_with_extension_child_id = filename.split('/', 1)
        # Remove the .npz extension from trial id
        trial_id_with_child_id = trial_id_with_extension_child_id.rsplit('.', 1)[0]  
        trial_id = trial_id_with_child_id.split("_")[0]
    else:
        continue
    print('child: ', child_id)
    print('trial: ', trial_id)
    # Need to pull session information in the future, our current experiment only uses a single session.
    session_id = 1
    # Check if this subject's data already exists in output file
    if (child_id, trial_id) in processed_participants:
        print(f'Skipping {child_id}, {trial_id} - already exists in output file')
        continue
    # determine trial info files 
    trial_info_file = lookit_trial_info_csv

    # get trial onsets and offsets from input file, match to iCatcher file
    trial_sets, trials_df = get_trial_sets(child_id, trial_id, session_id, trial_info_file)
    trials_df = trials_df.reset_index(drop=True)
    if (trials_df.empty):
        print(f'Skipping {child_id}, {trial_id} - no trial info found')
        continue
    
    if include_manual_edits:
        manual_edits_df = manually_coded_sections[manually_coded_sections['child_id'] == child_id and manually_coded_sections['trial_id'] == trial_id]
    
    # determine video source    
    vid_path = op.join(videos_dir, child_id, f"{trial_id}_{child_id}.mp4")
    json_video_data = op.join(videos_dir, child_id, f"{trial_id}_{child_id}.json")
    # get timestamp for each frame in the video
    timestamps, length = get_frame_information(vid_path, json_video_data)
    if not timestamps:
        print('video not found for {} in {} folder'.format(child_id, videos_dir))
        continue

    # initialize df with time stamps for iCatcher file
    icatcher_path = icatcher_outputs_dir + '/' + filename
    
    if include_manual_edits:
        [looks_df_grouped, looks_df] = read_convert_output(icatcher_path, timestamps, manual_edits_df)
    else:
        [looks_df_grouped, looks_df] = read_convert_output(icatcher_path, timestamps)
    
    looks_df_grouped = looks_df_grouped[['Looks.ordinal', 'Looks.onset', 'Looks.offset', 'Looks.lookType', 'Looks.confidence']]
    looks_df_grouped = looks_df_grouped.reset_index(drop=True)

    # make subject level dataframe
    subject_info = get_subject_info(child_id, trial_id, session_id, trial_info_file)
    subject_info = subject_info.loc[[0], :].copy()
    subject_info['SubjectInfo.subjID'] = subject_info['SubjectInfo.subjID'].astype(str)
    subject_info = subject_info.reset_index(drop=True)
    
    
    df_concat = pd.concat([pd.DataFrame(np.repeat(subject_info.values, len(looks_df_grouped), axis=0), columns=subject_info.columns), 
                           looks_df_grouped, pd.DataFrame(np.repeat(trials_df.values, len(looks_df_grouped), axis=0), columns=trials_df.columns)], axis=1)
    
    # add trial_id and child_id to the dataframe
    looks_df['Trials.trialID'] = trial_id
    looks_df['SubjectInfo.subjID'] = child_id

    if not os.path.exists(output_data_dir):
        os.makedirs(output_data_dir)
    # Append or create files for df_concat
    if os.path.exists(fname_final_output):
        looks_df.to_csv(fname_final_output, mode='a', header=False, index=False)  # Append without header
    else:
        looks_df.to_csv(fname_final_output, index=False)  # Create new file        


CKRG26/easy-turtle-horse-distractor_CKRG26.npz
child:  CKRG26
trial:  easy-turtle-horse-distractor
Skipping CKRG26, easy-turtle-horse-distractor - already exists in output file
CKRG26/easy-turtle-horse_CKRG26.npz
child:  CKRG26
trial:  easy-turtle-horse
Skipping CKRG26, easy-turtle-horse - already exists in output file
CKRG26/hard-acorn-coconut_CKRG26.npz
child:  CKRG26
trial:  hard-acorn-coconut
Skipping CKRG26, hard-acorn-coconut - already exists in output file
CKRG26/easy-acorn-key_CKRG26.npz
child:  CKRG26
trial:  easy-acorn-key
Skipping CKRG26, easy-acorn-key - already exists in output file
CKRG26/easy-snail-cow_CKRG26.npz
child:  CKRG26
trial:  easy-snail-cow
Skipping CKRG26, easy-snail-cow - already exists in output file
CKRG26/hard-potato-pot_CKRG26.npz
child:  CKRG26
trial:  hard-potato-pot
Skipping CKRG26, hard-potato-pot - already exists in output file
CKRG26/hard-cheese-butter_CKRG26.npz
child:  CKRG26
trial:  hard-cheese-butter
Skipping CKRG26, hard-cheese-butter - already

KeyboardInterrupt: 