In [1]:
import pandas as pd
import numpy as np
import json
import os

## Loading the Data

The data is all stored in the folder: "pre_processed_data"

Each *.csv* in that folder correspond to a single participants data. One can either load the entire data into a single dataframe or access a single user.

|Language|Vol/MTurker|   Counts  |
|:-------|---------|------------:|
| German  |  MTurk |          50 |
| German  |  Vol.  |           3 |
| English |  MTurk |         183 |
| Spanish |  MTurk |          99 |
| Spanish |  Vol.  |           8 |
| Greek   |  Vol.  |           1 |
| Russian |  Vol.  |           4 |
| Turkish |  MTurk |          15 |
| Turkish |  Vol.  |         210 |

In [2]:
PRE_PROCESS_DATA_FOLDER = "pre_processed_data"
TEXT_DATA_FOLDER = os.path.join("experiment_data", "set_texts")
FIXATION_DATA_FOLDER = os.path.join("pre_processed_data","fixation_data_per_part")

In [3]:
data_list = []

for file in os.listdir(PRE_PROCESS_DATA_FOLDER):
    if ".json" not in file:
        continue
    with open(os.path.join(PRE_PROCESS_DATA_FOLDER, file), "r") as f:
        load_data = json.load(f)
    data_list.append(pd.Series(list(load_data.values()), index=load_data.keys()))

In [4]:
data = pd.DataFrame(data_list)

### Fields for each participant:

- **worker_id**: the worker id, if it contains "link" then it means it was a volunteer participant.
- **worker_age**: participant's age, reported in a small questionaire before the start of the experiment.
- **worker_lang**: mothertongue of the participant.
- **worker_fluency**: from a scale from 1-5 how fluent the participants are in the language of the experiment.
- **set_name**: the set which the participant completed.
- **set_trials**: the trials completed within the set for this participant.
- **target_error**: flag denoting if there was errors when calculating the targets for the participant.
- **screen_x**: width of the screen reported by the participant's browser.
- **screen_y**: height of the screen reported by the participant's browser.
- **webgazer_raw_data**: webgazer's gaze points before any pre-processing (x,y,t), stored as a json string for each trial.
- **webgazer_filtered_data**: webgazer's fixation points after pre-processing, with the fixation duration (x,y,t,d), stored as a json string for each trial.
- **avg_roi_last_val**: average accuracy (points within a ROI region around the validation point of 100px).
- **webgazer_sample_rate**: sampling rate of webgazer in Hz
- **exp_total_time**: total time spent completing the experiment (this is calculated as total time spent in each trial)
- **experiment_reload**: flag indicating if the participant refreshed the page
- **questions_duplicated**: flag indicating if the participant repeated questions. Note: if the participant repeated questions, then the first attempt is the one stored in the respective columns.
- **approved_flag**: flag to denote if the participant was approved. 0 = "Not Aproved ($<50\%$)", 1 = "Approved ($\ge50\%$ correct)", 2 = "Approved ($\ge60\%$ correct)", 3 = "Approved ($\ge75\%$ correct)"
- **total_correct_answers**: number of correct answers for the participant
- **set_language**: language in which the set is in
- **fixation_error**: flag to denote if there is an error with fixations.

Note all of these are correponding to trials, where i is the sequence of the trial. These follow the order in **set_trials**.
- **trial_{i}_name**: name of the text
- **trial_{i}_condition**: 'nr' for normal reading ($i<5$) and 'is' for information seeking ($i\ge5$).
- **trial_{i}_time**: time spent on the trial
- **trial_{i}_fixation_total**: total number of fixations.
- **trial_{i}_fixation_text_TRT**: total duration of fixations within the image of the paragraph.
- **trial_{i}_fixation_target_TRT**: total duration of fixation in the target (question_{i}_name)
- **trial_{i}_fixation_on_target**: total number of fixations in the target (question_{i}_name)
- **question_{i}_name**: name of the question
- **question_{i}_time**: time spent on the question trial
- **question_{i}_answer**: answer given
- **question_{i}_correct_flag**: if the answer was considered correct
- **question_{i}_target_to_fixation_ratio**: ratio between total fixations and fixations on the target span of the text.

For the 'is' condition, $i\ge5$ there are two more columns:
- **pre_question_{i}_name**: name of the question, before reading the text
- **pre_question_{i}_time**: time spent on the question before reading the text.



We have applied some filters that we use for the data presented in the WebQAmGaze paper, which are the following:

In [5]:
# Filters to apply:
approved_only = (data.approved_flag > 0).to_numpy() 
no_fixation_error = (data.fixation_error == False).to_numpy()
no_target_error = (data.target_error == False).to_numpy()
sample_higher_10 = (data.webgazer_sample_rate > 10).to_numpy()
acc_higher = (data.avg_roi_last_val > 0).to_numpy()
filter_mturks = np.array([False if "link" in worker_id else True for worker_id in data["worker_id"]])
filter_sets = np.array([True if set_lang in ["EN","ES","DE"] else False for set_lang in data["set_language"]])

screen_x_above_1280 = (data.screen_x > 1110).to_numpy() # Some tolerance
screen_y_above_720 = (data.screen_y > 615).to_numpy() # Some Tolerance
screen_above_1280_720 = screen_x_above_1280 & screen_y_above_720

In [6]:
dict_filter = {
    "filter_mturks" : filter_mturks,
    "filter_sets" : filter_sets,
    "Approved":approved_only,
    "Fix_Error, Target_Error": no_fixation_error & no_target_error,
    "Sample Rate": sample_higher_10,
    "acc_thresh": acc_higher,
    "screen_above_1280_720": screen_above_1280_720,
}

In [7]:
n_total = len(data)
current_filter = np.ones(len(data),dtype=bool)
for condition, f in dict_filter.items():
    n_data_filtered = len(data.iloc[~f & current_filter])
    per_cent = n_data_filtered/n_total * 100
    print(f"For condition ({condition}), {per_cent:.2f}% has been filtered. ({n_data_filtered} out of {n_total})")
    current_filter = current_filter & f
    n_total = len(data.iloc[current_filter])

For condition (filter_mturks), 39.44% has been filtered. (226 out of 573)
For condition (filter_sets), 4.32% has been filtered. (15 out of 347)
For condition (Approved), 31.02% has been filtered. (103 out of 332)
For condition (Fix_Error, Target_Error), 4.37% has been filtered. (10 out of 229)
For condition (Sample Rate), 9.59% has been filtered. (21 out of 219)
For condition (acc_thresh), 1.52% has been filtered. (3 out of 198)
For condition (screen_above_1280_720), 0.51% has been filtered. (1 out of 195)


In [8]:
mask = filter_mturks & approved_only & no_fixation_error & no_target_error & sample_higher_10 & screen_above_1280_720 & acc_higher & filter_sets

In [9]:
data_filtered = data[mask].copy()

In [10]:
# Sets collected (each will have different texts)
data_filtered.set_name.unique()

array(['mturk_EN_v01', 'mturk_EN_v20', 'mturk_EN_v11', 'mturk_EN_v16',
       'mturk_ES_v07', 'mturk_EN_v17', 'mturk_EN_v04', 'mturk_DE_v03',
       'mturk_EN_v14', 'mturk_EN_v08', 'mturk_EN_v13', 'mturk_EN_v19',
       'mturk_EN_v09', 'mturk_ES_v05', 'mturk_EN_v12', 'mturk_ES_v03',
       'mturk_ES_v01', 'mturk_EN_v15', 'mturk_ES_v02', 'mturk_EN_v18',
       'mturk_DE_v02', 'mturk_EN_v03', 'mturk_ES_v08', 'mturk_ES_v06',
       'mturk_DE_v04', 'mturk_EN_v07', 'mturk_EN_v06', 'mturk_DE_v01',
       'mturk_ES_v04', 'mturk_EN_v10', 'mturk_ES_v09', 'mturk_EN_v02',
       'mturk_EN_v05'], dtype=object)

In [11]:
# Selecting a single participant
participant_EN_v01_n1 = data_filtered[data_filtered.set_name == "mturk_EN_v01"].iloc[0]

### Information about the set texts

To retrieve information about the text used for each set, a csv is created with the name, the text itself, the task it was used in and the correct questions for each question as well as the language the text is in.

| Language|Original Dataset| Unique Texts | Char Count   |   Token Count |   Token AVG Length |   Sentence Count |Average Token per Sentence|                       
|:-------|:----------------|-------------:|-------------:|--------------:|-------------------:|-----------------:|---------------------------:|   
|German  | MECO            |            2 |      1185.00    |        185.00    |               5.52 |             9.00    |                        20.72 |
|German  | XQuAD           |           45 |       508.87 |         82.04 |               5.37 |             3.22 |                        29.10  |
|English | MECO            |            4 |      1191.25 |        203.50  |               4.98 |             8.25 |                        24.70  |
|English | XQuAD           |           97 |       528.28 |         97.15 |               4.60  |             3.43 |                        32.60  |
|Spanish | MECO            |            1 |      1092.00    |        195.00    |               4.70  |             8.00    |                        24.38 |
|Spanish | XQuAD           |           64 |       541.58 |         98.66 |               4.60  |             3.19 |                        34.47 |
|Greek   | MECO            |            1 |      1253.00    |        206.00    |               5.21 |             8.00    |                        25.75 |
|Greek   | XQuAD           |            9 |       595.44 |         99.33 |               5.13 |             3.22 |                        33.75 |
|Russian | MECO            |            1 |      1228.00    |        180.00    |               5.96 |             8.00    |                        22.50  |
|Russian | XQuAD           |            9 |       534.11 |         85.89 |               5.47 |             3.44 |                        27.14 |
|Turkish | MECO            |            2 |      1206.00    |        155.00    |               6.84 |             8.00    |                        19.38 |
|Turkish | XQuAD           |           36 |       545.36 |         77.19 |               6.15 |             3.86 |                        21.69 |

In [12]:
text_data = pd.read_csv(os.path.join(TEXT_DATA_FOLDER, "text_settings_mturk_EN_v01.csv"), index_col=0)
text_data

Unnamed: 0,stimulus,trial_name,task_type,correct_answer,lang
2,"In competitive sports, doping is the use o...",meco_para_3,NR,,EN
3,Can doping have adverse health effects?,meco_para_3_qa_0,NR,True,EN
4,A further type of committee is normally set up...,a_ScottishParliament_2,NR,,EN
5,What is set up to scrutinize private bills sub...,a_ScottishParliament_2_qa_0,NR,committee,EN
6,As northwest Europe slowly began to warm up fr...,a_Rhine_3,NR,,EN
7,When did Europe slowly begin to warm up from t...,a_Rhine_3_qa_0,NR,"22,000 years ago",EN
8,"Fresno is served by State Route 99, the main n...",a_FresnoCalifornia_3,NR,,EN
9,What is another name for the Yosemite Freeway?,a_FresnoCalifornia_3_qa_2,NR,State Route 41,EN
10,Published at a time of rising demand for Germa...,a_MartinLuther_2,NR,,EN
11,At the time of Martin Luther what was in demand?,a_MartinLuther_2_qa_0,NR,German-language publications,EN


This information can also be accessed in a combined table, which contains all the information for all the sets: "combined_text_features.csv"

With this table, there is also a statistics tables containing the number of tokens and sentences per text, "text_token_stats.csv". These were calculated using the following packages:

- https://github.com/apdullahyayik/TrTokenizer: for Turkish
- https://spacy.io/ (Spacy): For all other languages

This table contains:
- **stimulus**: text used in the trial
- **trial_name**: name of the trial
- **task_type**: which task the text/question the text was first presented in.
- **correct_answer**: for a given question
- **lang**: language of the text
- **char_count**: calculated as the length of a text string.
- **token_count**: size of the list of tokens calculated by the tokenizer.
- **token_avg_length**: avg. number of characters in the tokens provided by the tokenizer.
- **sentence_count**: count of sentences given by sentencizer. 
- **original_dataset**: dataset of origin of the stimulus
- **average_token_per_sentence**: number of tokens given the sentence number.
- **is_question**: Denotes if the text is a question or not

In [13]:
text_stats = pd.read_csv("text_token_stats.csv", index_col=0)
text_stats

Unnamed: 0,stimulus,trial_name,task_type,correct_answer,lang,set_name,char_count,token_count,token_avg_length,sentence_count,original_dataset,average_token_per_sentence,is_question
0,Ein Fahrzeugkennzeichen ist ein Metall- oder K...,meco_para_12,NR,,DE,mturk_DE_v01,1199,178,5.842697,8,MECO,22.250000,False
1,"Besteht eine Fahrzeugkennung aus Buchstaben, Z...",meco_para_12_qa_3,NR,False,DE,mturk_DE_v01,66,10,5.900000,1,MECO,10.000000,True
2,Lehrkräfte in Wales können registrierte Mitgli...,a_Teacher_1,NR,,DE,mturk_DE_v01,391,62,5.419355,4,XQuAD,15.500000,False
3,Was ist die NASUWT?,a_Teacher_1_qa_3,NR,Gewerkschaften,DE,mturk_DE_v01,19,5,3.200000,1,XQuAD,5.000000,True
4,Diese Vorhersage war in der abschließenden Zus...,a_IntergovernmentalPanelonClimateChange_2,NR,,DE,mturk_DE_v01,610,100,5.260000,4,XQuAD,25.000000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
649,Cezayir'de 1620-21 yıllarında kaç kişi kaybedi...,q_after_a_BlackDeath_4_qa_1,IS,30 ila 50 bin,TR,mturk_TR_v04,54,7,7.000000,1,XQuAD,7.000000,True
650,Birçok kromalveolat soyundaki kayıp kloroplast...,a_Chloroplast_3,IS,,TR,mturk_TR_v04,577,74,6.864865,3,XQuAD,24.666667,False
651,Diatomlar ne çeşit bir kloroplasta sahiptir?,q_after_a_Chloroplast_3_qa_2,IS,kırmızı alg kökenli kloroplast,TR,mturk_TR_v04,44,7,5.571429,1,XQuAD,7.000000,True
652,Siyanobakteri bir ataya izi sürülebilen bu klo...,a_Chloroplast_1,IS,,TR,mturk_TR_v04,428,61,6.098361,3,XQuAD,20.333333,False


## Extracting the fixations & targets for a single trial and participant

In [14]:
# Get the data for the first participant from the 1st english set
participant_EN_v01_n1 = data_filtered[data_filtered.set_name == "mturk_EN_v01"].iloc[0]

In [15]:
# To load the data both 'webgazer_filtered_data' and 'webgazer_raw_data' are json strings that can be
# loaded by the json libary in python as shown below:
# 'webgazer_filtered_data': Contains the data with pre-processing steps of merging and removal of short fixations. 
# These have been translated to be in a (-100,1380, -100,820) coordinate space, to fit the images in 1280x720
# 'webgazer_raw_data': Contains the raw data collected by webgazer, which are in the original screen coordinates.
# Here I load the already filtered data:
example_of_data_load = json.loads(participant_EN_v01_n1["webgazer_filtered_data"])

In [16]:
# See trials to which we have data for:
example_of_data_load.keys()

dict_keys(['meco_para_3', 'meco_para_3_qa_0', 'a_ScottishParliament_2', 'a_ScottishParliament_2_qa_0', 'a_Rhine_3', 'a_Rhine_3_qa_0', 'a_FresnoCalifornia_3', 'a_FresnoCalifornia_3_qa_2', 'a_MartinLuther_2', 'a_MartinLuther_2_qa_0', 'q_before_a_Amazonrainforest_4_qa_1', 'a_Amazonrainforest_4', 'q_after_a_Amazonrainforest_4_qa_1', 'q_before_a_Chloroplast_2_qa_0', 'a_Chloroplast_2', 'q_after_a_Chloroplast_2_qa_0', 'q_before_a_NikolaTesla_1_qa_1', 'a_NikolaTesla_1', 'q_after_a_NikolaTesla_1_qa_1', 'q_before_a_UniversityofChicago_3_qa_2', 'a_UniversityofChicago_3', 'q_after_a_UniversityofChicago_3_qa_2', 'q_before_a_VictoriaAustralia_2_qa_0', 'a_VictoriaAustralia_2', 'q_after_a_VictoriaAustralia_2_qa_0'])

### Loading the Targets for a Text

In [17]:
# To load the word boundaries for the trial above we need 2 things:
# 1. The Set which the experiment is in, which can be accessed in column 'set_name'
# 2. The trial name, which can be seen by the keys in the dictionary.

In [18]:
set_name_trial_example = participant_EN_v01_n1["set_name"]
set_name_trial_example

'mturk_EN_v01'

In [19]:
word_boundaries_path = os.path.join("target_dataframes", f"target_dataframe_{set_name_trial_example}.csv")

The target dataframe contains the text_name, the target_name, the boundary box for the target as well as a flag "is_word" to denote if the target is a word boundary. These targets are calculated based on the HTML render of the text onto a **1280x720** browser.

In [20]:
word_boundaries_df = pd.read_csv(word_boundaries_path,  index_col=0)
# Remove the resolution of the images:
word_boundaries_df.is_word = word_boundaries_df.is_word.astype(bool)
word_boundaries_df.head(10)

Unnamed: 0,text_name,target_name,x,y,width,height,top,right,bottom,left,is_word
0,meco_para_3,meco_para_3_paragraph,34.0,19.5,1215.0,716.799988,22.0,1230.0,730.799988,50.0,False
1,meco_para_3,meco_para_3_qa_0,54.0,505.200012,1131.133301,79.800003,507.700012,1166.133301,579.500015,70.0,False
2,meco_para_3,meco_para_3_qa_1,54.0,45.399994,1169.116699,79.800003,47.899994,1204.116699,119.699997,70.0,False
3,meco_para_3,meco_para_3_qa_2,54.0,379.799988,1113.81665,79.800003,382.299988,1148.81665,454.099991,70.0,False
4,meco_para_3,meco_para_3_qa_3,250.93335,379.799988,348.133331,38.0,382.299988,580.066681,412.299988,266.93335,False
5,meco_para_3,meco_para_3_In_0,56.0,46.333333,48.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
6,meco_para_3,meco_para_3_competitive_0,105.0,46.333333,150.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
7,meco_para_3,"meco_para_3_sports,_0",256.0,46.333333,100.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
8,meco_para_3,meco_para_3_doping_0,357.0,46.333333,102.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
9,meco_para_3,meco_para_3_is_0,460.0,46.333333,45.0,39.666667,-1.0,-1.0,-1.0,-1.0,True


In [21]:
# To loop over all the texts in a set:
# Texts names that contain _qa_ are questions corresponds to the questions
all_texts_in_set = list(example_of_data_load.keys())
all_texts_in_set

['meco_para_3',
 'meco_para_3_qa_0',
 'a_ScottishParliament_2',
 'a_ScottishParliament_2_qa_0',
 'a_Rhine_3',
 'a_Rhine_3_qa_0',
 'a_FresnoCalifornia_3',
 'a_FresnoCalifornia_3_qa_2',
 'a_MartinLuther_2',
 'a_MartinLuther_2_qa_0',
 'q_before_a_Amazonrainforest_4_qa_1',
 'a_Amazonrainforest_4',
 'q_after_a_Amazonrainforest_4_qa_1',
 'q_before_a_Chloroplast_2_qa_0',
 'a_Chloroplast_2',
 'q_after_a_Chloroplast_2_qa_0',
 'q_before_a_NikolaTesla_1_qa_1',
 'a_NikolaTesla_1',
 'q_after_a_NikolaTesla_1_qa_1',
 'q_before_a_UniversityofChicago_3_qa_2',
 'a_UniversityofChicago_3',
 'q_after_a_UniversityofChicago_3_qa_2',
 'q_before_a_VictoriaAustralia_2_qa_0',
 'a_VictoriaAustralia_2',
 'q_after_a_VictoriaAustralia_2_qa_0']

### Loading the Fixations for a trial

In [22]:
# Get the word boundaries for the text:
text_to_see = all_texts_in_set[0]

In [23]:
# Get the fixation data:
# Note the duration of the first fixation is always set to 0.
np.array(example_of_data_load[text_to_see])

array([[  808,   395,    13,     0],
       [  721,   371,    84,    71],
       [  674,   374,   219,   135],
       ...,
       [  783,   473, 69569,    80],
       [  357,   648, 69954,    80],
       [  313,   638, 70428,   474]])

In [24]:
# Text is the text which the word appears in, and the counter represents the nth ocurrence of the word.
# x,y correspond to the top left corner of the bounding box for the image 0-1280, 0-720
# top,right,bottom,left, aren't used.
words_for_text = word_boundaries_df[(word_boundaries_df.text_name == text_to_see) & word_boundaries_df.is_word].copy()

In [25]:
words_for_text

Unnamed: 0,text_name,target_name,x,y,width,height,top,right,bottom,left,is_word
5,meco_para_3,meco_para_3_In_0,56.0,46.333333,48.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
6,meco_para_3,meco_para_3_competitive_0,105.0,46.333333,150.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
7,meco_para_3,"meco_para_3_sports,_0",256.0,46.333333,100.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
8,meco_para_3,meco_para_3_doping_0,357.0,46.333333,102.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
9,meco_para_3,meco_para_3_is_0,460.0,46.333333,45.0,39.666667,-1.0,-1.0,-1.0,-1.0,True
...,...,...,...,...,...,...,...,...,...,...,...
181,meco_para_3,meco_para_3_against_0,1033.0,631.200000,105.0,40.800000,-1.0,-1.0,-1.0,-1.0,True
182,meco_para_3,meco_para_3_the_14,1137.0,631.200000,65.0,40.800000,-1.0,-1.0,-1.0,-1.0,True
183,meco_para_3,meco_para_3_‘spirit_0,55.0,673.166667,87.0,40.666667,-1.0,-1.0,-1.0,-1.0,True
184,meco_para_3,meco_para_3_of_6,143.0,673.166667,52.0,40.666667,-1.0,-1.0,-1.0,-1.0,True


In [26]:
# Create a column containing the word that the bounding box contains.
words_for_text["word"] = words_for_text.target_name.str.replace(f"{text_to_see}_","").str.split("_").str[0]

In [27]:
words_for_text

Unnamed: 0,text_name,target_name,x,y,width,height,top,right,bottom,left,is_word,word
5,meco_para_3,meco_para_3_In_0,56.0,46.333333,48.0,39.666667,-1.0,-1.0,-1.0,-1.0,True,In
6,meco_para_3,meco_para_3_competitive_0,105.0,46.333333,150.0,39.666667,-1.0,-1.0,-1.0,-1.0,True,competitive
7,meco_para_3,"meco_para_3_sports,_0",256.0,46.333333,100.0,39.666667,-1.0,-1.0,-1.0,-1.0,True,"sports,"
8,meco_para_3,meco_para_3_doping_0,357.0,46.333333,102.0,39.666667,-1.0,-1.0,-1.0,-1.0,True,doping
9,meco_para_3,meco_para_3_is_0,460.0,46.333333,45.0,39.666667,-1.0,-1.0,-1.0,-1.0,True,is
...,...,...,...,...,...,...,...,...,...,...,...,...
181,meco_para_3,meco_para_3_against_0,1033.0,631.200000,105.0,40.800000,-1.0,-1.0,-1.0,-1.0,True,against
182,meco_para_3,meco_para_3_the_14,1137.0,631.200000,65.0,40.800000,-1.0,-1.0,-1.0,-1.0,True,the
183,meco_para_3,meco_para_3_‘spirit_0,55.0,673.166667,87.0,40.666667,-1.0,-1.0,-1.0,-1.0,True,‘spirit
184,meco_para_3,meco_para_3_of_6,143.0,673.166667,52.0,40.666667,-1.0,-1.0,-1.0,-1.0,True,of


### Participant's fixation Dictionary

We also provide the dicitonary compose of targets and the fixations on each word.

These are stored in "pre_processed_data/fixation_data_per_part/{worker_id}_{set_name}__fix_dict.csv", for each participant.

This is an example how to load the data for the first participant of the first english set. 

In [28]:
# To load the TRTs and Number of Fixations for each word given a trial one can use:
participant_fixation_dictionary = pd.read_csv(os.path.join(FIXATION_DATA_FOLDER,f"{participant_EN_v01_n1.worker_id}_{participant_EN_v01_n1.set_name}_fix_dict.csv")) 
participant_fixation_dictionary


Unnamed: 0,word_id,TRT,FixCount,Span_word_is_in,text_id
0,In_0,,0,meco_para_3_qa_1,meco_para_3
1,competitive_0,480.0,3,meco_para_3_qa_1,meco_para_3
2,"sports,_0",,0,meco_para_3_qa_1,meco_para_3
3,doping_0,113.0,1,meco_para_3_qa_1,meco_para_3
4,is_0,99.0,1,meco_para_3_qa_1,meco_para_3
...,...,...,...,...,...
949,Geelong—will_0,358.0,4,,a_VictoriaAustralia_2
950,close_0,,0,,a_VictoriaAustralia_2
951,in_4,,0,,a_VictoriaAustralia_2
952,October_0,,0,a_VictoriaAustralia_2_qa_3,a_VictoriaAustralia_2


#### Fields in the Fixation Dicionary:
- **word_id**: the word fixated, 
- **TRT**: the total time in ms of fixations in a word, 
- **FixCount**: number of fixations on word
- **Span_word_is_in**: original span that the fixation belongs to. This can be checked against the column participant_EN_v01_n1['question_{number_of_trial}_name'] to see which question was asked. There can be multiple targets for a single word, and so the spans are seperated by a space.
- **text_id**: text name, which can be found in participant_EN_v01_n1['trial_{number_of_trial}_name']

In [29]:
# To see the TRTs and FixationCounts for words in the first text:
participant_fixation_dictionary[participant_fixation_dictionary['text_id'] == participant_EN_v01_n1[f'trial_{0}_name']]

Unnamed: 0,word_id,TRT,FixCount,Span_word_is_in,text_id
0,In_0,,0,meco_para_3_qa_1,meco_para_3
1,competitive_0,480.0,3,meco_para_3_qa_1,meco_para_3
2,"sports,_0",,0,meco_para_3_qa_1,meco_para_3
3,doping_0,113.0,1,meco_para_3_qa_1,meco_para_3
4,is_0,99.0,1,meco_para_3_qa_1,meco_para_3
...,...,...,...,...,...
176,against_0,,0,,meco_para_3
177,the_14,,0,,meco_para_3
178,‘spirit_0,,0,,meco_para_3
179,of_6,,0,,meco_para_3
