## Experimental set-up: ##

This code will generate experimental files that can either be independently hosted on a website and run with recruited participants, or via our [MTurk iPython notebook](https://github.com/a-newman/mturk-api-notebook) be used for launching Amazon Mechanical Turk (MTurk) HITs. 

An experiment is composed of different sets of images:
* **target images** are the images you want to collect attention data on - those are images that you provide (in directory `sourcedir` below)
* **tutorial images** are images that will be shown to participants at the beginning of the experiment to get them familiarized with the codecharts set-up (you can reuse the tutorial image we provide, or provide your own in directory `tutorial_source_dir` below)
    * *hint: if your images are very different in content from the images in our set, you may want to train your participants on your own images, to avoid a context switch between the tutorial and main experiment*
* **sentinel images** are images interspersed throughout the experiment where participant attention is guided to a very specific point on the screen, used as validation/calibration images to ensure participants are actually moving their eyes and looking where they're supposed to; the code below will intersperse images from the `sentinel_target_images` directory we provide throughout your experimental sequence
    * sentinel images can be interspersed throughout both the tutorial and target images, or excluded from the tutorial (via `add_sentinels_to_tutorial` flag below); we recommend having sentinel images as part of the tutorial to familiarize participants with such images as well
    
The code below will populate the `rootdir` task directory with #`num_subject_files` subject files for you, where each subject file corresponds to an experiment you can run on a single participant. For each subject file, a set of #`num_images_per_sf` will be randomly sampled from the `sourcedir` image directory. A set of #`num_sentinels_per_sf` sentinel images will also be sampled from the `sentinel_imdir` image directory, and distributed throughout the experiment. A tutorial will be generated at the beginning of the experiment with #`num_imgs_per_tutorial` randomly sampled from the `tutorial_source_dir` image directory, along with an additional #`num_sentinels_per_tutorial` sentinel files distributed throughout the tutorial (if `add_sentinels_to_tutorial` flag is set to true). 

In [12]:
import os
import string
import random
import json 
import matplotlib.pyplot as plt
import numpy as np
import base64 
import glob



In [13]:
sourcedir = '../photo_pub/' # replace this with your own directory of experiment images
tutorial_source_dir = 'tutorial_images'  # you can reuse the tutorial images we provide, or provide your own directory

In [14]:

# PARAMETERS for generating subject files

num_subject_files = 3     # number of subject files to generate (i.e., # of mturk assignments that will be put up)    

num_images_per_sf = 3    # number of target images per subject file 

num_imgs_per_tutorial = 2 # number of tutorial images per subject file

num_sentinels_per_sf = 1  # number of sentinel images to distribute throughout the experiment (excluding the tutorial)

add_sentinels_to_tutorial = True # whether to have sentinel images as part of the tutorial

num_sentinels_per_tutorial = 1   # number of sentinel images to distribute throughout the tutorial

Another bit of terminology and experimental logistics involves **buckets** which are a way to distribute experiment stimuli so that multiple experiments can be run in parallel (and participants can be reused for different subsets of images). If you have a lot of images that you want to collect data on, and for each participant you are sampling a set of only #`num_images_per_sf`, then you might have to generate a large `num_subject_files` in order to have enough data points per image. A way to speed up data collection is to split all the target images into #`num_buckets` disjoint buckets, and then to generate subject files per bucket. Given that subject files generated per bucket are guaranteed to have a disjoint set of images, the same participant can be run on multiple subject files from different buckets. MTurk HITs corresponding to different buckets can be launched all at once. In summary, in MTurk terms, you can generate as many HITs as `num_buckets` specified below, and as many assignments per HIT as `num_subject_files`. 

The way the codecharts methodology works, a jittered grid of alphanumeric triplets appears after every image presentation (whether it is a target, sentinel, or tutorial image), since a participant will need to indicate where on the preceding image s/he looked, by reporting a triplet. To avoid generating an excessive number of codecharts (that bulks up all the subject files), we can reuse some codecharts across buckets. The way we do this is by pre-generating #`ncodecharts` codecharts, and then randomly sampling from these when generating the individual subject files.

In [15]:

# we pre-generate some codecharts and sentinel images so that we can reuse these across participants and buckets 
# and potentially not have to generate as many files; these can be set to any number, and the corresponding code
# will just sample as many images as need per subject file

ncodecharts = num_subject_files*num_images_per_sf # number of codecharts to generate; can be changed
sentinel_images_per_bucket = num_subject_files*num_sentinels_per_sf # can be changed

# set these parameters
num_buckets = 1      # number of disjoint sets of subject files to create (for running multiple parallel HITs)
start_bucket_at = 0  # you can use this and the next parameter to generate more buckets if running the code later
which_buckets = [0]  # a list of specific buckets e.g., [4,5,6] to generate experiment data for

rootdir = '../assets/task_data' # where all the experiment data will be stored
if not os.path.exists(rootdir):
    print('Creating directory %s'%(rootdir))
    os.makedirs(rootdir)

real_image_dir = os.path.join(rootdir,'real_images')              # target images, distributed by buckets
real_CC_dir = os.path.join(rootdir,'real_CC')                     # codecharts corresponding to the target images 
                                                                  # (shared across buckets)
sentinel_image_dir = os.path.join(rootdir,'sentinel_images')      # sentinel images, distributed by buckets
sentinel_CC_dir = os.path.join(rootdir,'sentinel_CC')             # codecharts corresponding to the sentinel images
                                                                  # (shared across buckets)
#sentinel_targetim_dir = os.path.join(rootdir, 'sentinel_target')  

In [16]:

# this cell creates an `all_images` directory, copies images from sourcedir, and pads them to the required dimensions

import create_padded_image_dir

all_image_dir = os.path.join(rootdir,'all_images')
if not os.path.exists(all_image_dir):
    print('Creating directory %s'%(all_image_dir))
    os.makedirs(all_image_dir)
    
allfiles = []
for ext in ('*.jpeg', '*.png', '*.jpg'):
    allfiles.extend(glob.glob(os.path.join(sourcedir, ext)))
print("%d files copied from %s to %s"%(len(allfiles),sourcedir,all_image_dir))
    
image_width,image_height = create_padded_image_dir.save_padded_images(all_image_dir,allfiles)

Creating directory ../assets/task_data\all_images
3 files copied from ../photo_pub/ to ../assets/task_data\all_images
Padding 3 image files to dimensions: [1844,1340]...
Done!


In [17]:
# this cell generates a central fixation cross the size of the required image dimensions
# it is a gray image with a white cross in the middle that is supposed to re-center participant gaze, and provide a
# momentary break, between consecutive images

from generate_central_fixation_cross import save_fixation_cross

save_fixation_cross(rootdir,image_width,image_height)

using font size: 37
Saved fixation cross image as ../assets/task_data\fixation-cross.jpg


In [18]:
# this cell creates the requested number of buckets and distributes images from `all_image_dir` to the corresponding
# bucket directories inside `real_image_dir`

from distribute_image_files_by_buckets import distribute_images

distribute_images(all_image_dir,real_image_dir,num_buckets,start_bucket_at)

Distributing images across 1 buckets
Populating ../assets/task_data\real_images/bucket0 with 3 images


In [19]:
# this cell generates #ncodecharts "codecharts" (jittered grids of triplets) of the required image dimensions

import generate_codecharts 
from create_codecharts_dir import create_codecharts

create_codecharts(real_CC_dir,ncodecharts,image_width,image_height)

Generating 9 codecharts...
0/9
Writing out ../assets/task_data\real_CC\CC_codes.json
Writing out ../assets/task_data\real_CC\CC_codes_full.json
Done!


We create sentinel images by taking a small object (one of a: fixation cross, red dot, or image of a face) and choosing a random location for it on a blank image (away from the image boundaries by at least `border_padding` pixels). The code below creates #`sentinel_images_per_bucket` such sentinel images in each bucket. 

In [20]:
# this cell generates #sentinel_images_per_bucket sentinel images per bucket, along with the corresponding codecharts

import generate_sentinels

# settings for generating sentinels
sentinel_type = "img" # one of 'fix_cross', 'red_dot', or 'img'
sentinel_imdir = 'sentinel_target_images' # directory where to find face images to use for generating sentinel images
                                          # only relevant if sentinel_type="img"

border_padding = 100 # used to guarantee that chosen sentinel location is not too close to border to be hard to spot

generate_sentinels.generate_sentinels(sentinel_image_dir,sentinel_CC_dir,num_buckets,start_bucket_at,sentinel_images_per_bucket,\
                       image_width,image_height,border_padding,sentinel_type,sentinel_imdir)

Populating ../assets/task_data\sentinel_images\bucket0 with 3 sentinel images
Populating ../assets/task_data\sentinel_CC\bucket0 with 3 corresponding codecharts
Writing out ../assets/task_data\sentinel_images\bucket0\sentinel_codes.json
Writing out ../assets/task_data\sentinel_images\bucket0\sentinel_codes_full.json


In [21]:
# this cell generates codecharts corresponding to tutorial images, as well as sentinel images for the tutorial

from generate_tutorials import generate_tutorials

# inherit border_padding and sentinel type from above cell

tutorial_image_dir = os.path.join(rootdir,'tutorial_images') # where processed tutorial images will be saved
if not os.path.exists(tutorial_image_dir):
    print('Creating directory %s'%(tutorial_image_dir))
    os.makedirs(tutorial_image_dir)
    
allfiles = []
for ext in ('*.jpeg', '*.png', '*.jpg'):
    allfiles.extend(glob.glob(os.path.join(tutorial_source_dir, ext)))

create_padded_image_dir.save_padded_images(tutorial_image_dir,allfiles,toplot=False,maxwidth=image_width,maxheight=image_height)

# TODO: or pick a random set of images to serve as tutorial images
N = 2 # number of images to use for tutorials (these will be sampled from to generate subject files below)
      # note: make this larger than num_imgs_per_tutorial so not all subject files have the same tutorials
    
N_sent = 3 # number of sentinels to use for tutorials 
# note: if equal to num_sentinels_per_tutorial, all subject files will have the same tutorial sentinels

generate_tutorials(tutorial_image_dir,rootdir,image_width,image_height,border_padding,N,sentinel_type,sentinel_imdir,N_sent)


Creating directory ../assets/task_data\tutorial_images
Padding 9 image files to dimensions: [1844,1340]...
Done!
A total of 9 images will be sampled from for the tutorials.
Populating ../assets/task_data\tutorial_sentinels with 3 sentinel images
Populating ../assets/task_data\tutorial_CC with 3 corresponding codecharts
Writing out ../assets/task_data\tutorial.json
Writing out ../assets/task_data\tutorial_full.json


Now that all the previous cells have generated the requisite image, codechart, sentinel, and tutorial files, the following code will generate `num_subject_files` individual subject files by sampling from the appropriate image directories and creating an experimental sequence. 

In [22]:
start_subjects_at = 0     # where to start creating subject files at (if had created other subject files previously)
#if os.path.exists(os.path.join(rootdir,'subject_files/bucket0')):
#    subjfiles = glob.glob(os.path.join(rootdir,'subject_files/bucket0/*.json'))
#    start_subjects_at = len(subjfiles)

real_codecharts = glob.glob(os.path.join(real_CC_dir,'*.jpg'))
sentinel_codecharts = glob.glob(os.path.join(sentinel_CC_dir,'*.jpg'))

with open(os.path.join(real_CC_dir,'CC_codes_full.json')) as f:
    real_codes_data = json.load(f) # contains mapping of image path to valid codes

## GENERATING SUBJECT FILES 
subjdir = os.path.join(rootdir,'subject_files')
if not os.path.exists(subjdir):
    os.makedirs(subjdir)
    os.makedirs(os.path.join(rootdir,'full_subject_files'))
    
with open(os.path.join(rootdir,'tutorial_full.json')) as f:
    tutorial_data = json.load(f) 
    
tutorial_real_filenames = [fn for fn in tutorial_data.keys() if tutorial_data[fn]['flag']=='tutorial_real']
tutorial_sentinel_filenames = [fn for fn in tutorial_data.keys() if tutorial_data[fn]['flag']=='tutorial_sentinel']
    
# iterate over all buckets 
for b in range(len(which_buckets)): 
    
    bucket = 'bucket%d'%(which_buckets[b])
    img_bucket_dir = os.path.join(real_image_dir,bucket)
    img_files = []
    for ext in ('*.jpeg', '*.png', '*.jpg'):
        img_files.extend(glob.glob(os.path.join(img_bucket_dir, ext)))
            
    sentinel_bucket_dir = os.path.join(sentinel_image_dir,bucket)
    sentinel_files = glob.glob(os.path.join(sentinel_bucket_dir,'*.jpg'))
    
    with open(os.path.join(sentinel_bucket_dir,'sentinel_codes_full.json')) as f:
        sentinel_codes_data = json.load(f) # contains mapping of image path to valid codes
        
    subjdir = os.path.join(rootdir,'subject_files',bucket)
    if not os.path.exists(subjdir):
        os.makedirs(subjdir)
        os.makedirs(os.path.join(rootdir,'full_subject_files',bucket))
    
    print('Generating %d subject files in bucket %d'%(num_subject_files,b))
    # for each bucket, generate subject files 
    for i in range(num_subject_files):
        
        random.shuffle(img_files)
        random.shuffle(sentinel_files)
        random.shuffle(real_codecharts)
        
        # for each subject files, add real images 
        sf_data = []
        full_sf_data = []

        # ADDING TUTORIALS
        random.shuffle(tutorial_real_filenames)
        random.shuffle(tutorial_sentinel_filenames)
        
        # initialize temporary arrays, because will shuffle real & sentinel tutorial images before adding to
        # final subject files
        sf_data_temp = []
        full_sf_data_temp = []
        
        for j in range(num_imgs_per_tutorial):
            
            image_data = {}
            fn = tutorial_real_filenames[j]
            image_data["image"] = fn
            image_data["codechart"] = tutorial_data[fn]['codechart_file'] # stores codechart path 
            image_data["codes"] = tutorial_data[fn]['valid_codes'] # stores valid codes 
            image_data["flag"] = 'tutorial_real' # stores flag of whether we have real or sentinel image
            full_image_data = image_data.copy() # identical to image_data but includes a key for coordinates
            full_image_data["coordinates"] = tutorial_data[fn]['coordinates'] # store (x, y) coordinate of each triplet 
            
            sf_data_temp.append(image_data)
            full_sf_data_temp.append(full_image_data)
        
        if add_sentinels_to_tutorial and num_sentinels_per_tutorial>0:
            
            for j in range(num_sentinels_per_tutorial):
                image_data2 = {}
                fn = tutorial_sentinel_filenames[j]
                image_data2["image"] = fn
                image_data2["codechart"] = tutorial_data[fn]['codechart_file'] # stores codechart path 
                image_data2["correct_code"] = tutorial_data[fn]['correct_code']
                image_data2["correct_codes"] = tutorial_data[fn]['correct_codes']
                image_data2["codes"] = tutorial_data[fn]['valid_codes'] # stores valid codes 
                image_data2["flag"] = 'tutorial_sentinel' # stores flag of whether we have real or sentinel image
                full_image_data2 = image_data2.copy() # identical to image_data but includes a key for coordinates
                full_image_data2["coordinate"] = tutorial_data[fn]['coordinate'] # stores coordinate for correct code
                full_image_data2["codes"] = tutorial_data[fn]['valid_codes'] # stores valid codes 
                full_image_data2["coordinates"] = tutorial_data[fn]['coordinates'] # store (x, y) coordinate of each triplet 
                
                sf_data_temp.append(image_data2)
                full_sf_data_temp.append(full_image_data2)
                
        # up to here, have sequentially added real images and then sentinel images to tutorial
        # now want to shuffle them
                
        perm = np.random.permutation(len(sf_data_temp))
        for j in range(len(perm)): # note need to make sure sf_data and full_sf_data correspond
            sf_data.append(sf_data_temp[perm[j]])
            full_sf_data.append(full_sf_data_temp[perm[j]])
        
        # ADDING REAL IMAGES 
        for j in range(num_images_per_sf): 
            image_data = {}
            image_data["image"] = img_files[j] # stores image path 

            # select a code chart
            pathname = real_codecharts[j] # since shuffled, will pick up first set of random codecharts
            
            image_data["codechart"] = pathname # stores codechart path 
            image_data["codes"] = real_codes_data[pathname]['valid_codes'] # stores valid codes 
            image_data["flag"] = 'real' # stores flag of whether we have real or sentinel image
            
            full_image_data = image_data.copy() # identical to image_data but includes a key for coordinates
            full_image_data["coordinates"] = real_codes_data[pathname]['coordinates'] # store locations - (x, y) coordinate of each triplet 

            sf_data.append(image_data)
            full_sf_data.append(full_image_data)

        ## ADDING SENTINEL IMAGES 
        
        sentinel_spacing = int(num_images_per_sf/float(num_sentinels_per_sf))
        insertat = num_imgs_per_tutorial+num_sentinels_per_tutorial + 1; # don't insert before all the tutorial images are done
        for j in range(num_sentinels_per_sf):
            sentinel_image_data = {}
            sentinel_pathname = sentinel_files[j]
            sentinel_image_data["image"] = sentinel_pathname # stores image path 
            sentinel_image_data["codechart"] = sentinel_codes_data[sentinel_pathname]['codechart_file']
            sentinel_image_data["correct_code"] = sentinel_codes_data[sentinel_pathname]['correct_code']
            sentinel_image_data["correct_codes"] = sentinel_codes_data[sentinel_pathname]['correct_codes']
            sentinel_image_data["codes"] = sentinel_codes_data[sentinel_pathname]["valid_codes"]
            sentinel_image_data["flag"] = 'sentinel' # stores flag of whether we have real or sentinel image
            
            # for analysis, save other attributes too
            full_sentinel_image_data = sentinel_image_data.copy() # identical to sentinel_image_data but includes coordinate key 
            full_sentinel_image_data["coordinate"] = sentinel_codes_data[sentinel_pathname]["coordinate"] # stores the coordinate of the correct code 
            full_sentinel_image_data["codes"] = sentinel_codes_data[sentinel_pathname]["valid_codes"] # stores other valid codes
            full_sentinel_image_data["coordinates"] = sentinel_codes_data[sentinel_pathname]["coordinates"] # stores the coordinate of the valid code 
            
            insertat = insertat + random.choice(range(sentinel_spacing-1,sentinel_spacing+2))
            insertat = min(insertat,len(sf_data)-1)

            sf_data.insert(insertat, sentinel_image_data)
            full_sf_data.insert(insertat, full_sentinel_image_data)

        # Add an image_id to each subject file entry
        image_id = 0 # represents the index of the image in the subject file 
        for d in range(len(sf_data)): 
            sf_data[d]['index'] = image_id
            full_sf_data[d]['index'] = image_id
            image_id+=1

        subj_num = start_subjects_at+i
        with open(os.path.join(rootdir,'subject_files',bucket,'subject_file_%d.json'%(subj_num)), 'w') as outfile: 
            print('Subject file %s DONE'%(outfile.name))
            json.dump(sf_data, outfile)
        with open(os.path.join(rootdir,'full_subject_files',bucket,'subject_file_%d.json'%(subj_num)), 'w') as outfile: 
            json.dump(full_sf_data, outfile)

Generating 3 subject files in bucket 0
Subject file ../assets/task_data\subject_files\bucket0\subject_file_0.json DONE
Subject file ../assets/task_data\subject_files\bucket0\subject_file_1.json DONE
Subject file ../assets/task_data\subject_files\bucket0\subject_file_2.json DONE
