# Self Regulation Ontology Project 

This file will start with a basic introduction to expanalysis and the python commands one would probably want to interact with expfactory data. I have also included some helper functions I have found useful that are not part of expanalysis proper. I'll make use of basic python commands along the way without explaining them. Hopefully you can understand them in context (or else google them!)

**Table of Contents** 
- Loading Data
- Basic expanalysis
- Pandas intro 
- Other functions



## Loading Data

Data is stored on an AWS server. To access the data we use expanalysis. I created a helper function that makes loading data a little easier.

In [2]:
from expanalysis.results import Result, get_filters

def load_data(access_token, data_loc, source = 'file', filters = None):
    if source == 'file':
        results = Result(filters = filters)
        results.load_results(data_loc)
    elif source == 'web':
        #Load Results from Database
        results = Result(access_token, filters = filters)
        results.export(data_loc + '.json')
    data = results.data
    return data 



Expanalysis raw data comes with many identifiers that are not important for the kind of analysis we are doing. For instance, we don't care about the cognitive atlas ID. Others may be important at some stage (the experiment_version), but the hope/plan is that for any particular experiment all collected data is using the same experiment version.

Do remove columns we don't care about we will define a filter. This is a dictionary that specifies some expanalysis columns and what to do with them. Here we just want to drop them

After that we need to specify an access token. I am loading it from a directory.

Finally, loading data from the web takes a loooong time. Since we will be working with more-or-less the same data time and time again, it makes sense to save it somewhere (this is what the helper function above does). So I specify a location to use. 

In [3]:
#set up filters
filters = get_filters()
drop_columns = ['battery_description', 'experiment_reference', 'experiment_version', \
         'experiment_name','experiment_cognitive_atlas_task']
for col in drop_columns:
    filters[col] = {'drop': True}

                  
f = open('/home/ian/Experiments/expfactory/docs/expfactory_token.txt')
access_token = f.read().strip()      
data_loc = '/home/ian/Experiments/expfactory/Self_Regulation_Ontology/Data/Pilot_Results'     


Now with all of that specified we can load the data. If you are loading the data for the first time, or want the most up-to-date data, set `source` to 'web'. Otherwise set it to 'file'

In [53]:
source_data = load_data(access_token, data_loc, filters = filters, source = 'file')


Empty datasets found! See Results.empty field


So that gives you a dataframe that houses your data. Yay! Let's look at what it looks like:

In [14]:
print 'Dimensions (Rows x Cols):', source_data.shape
print '\nColumns:'
print source_data.columns

Dimensions (Rows x Cols): (3497, 8)

Columns:
Index([         u'finishtime',        u'battery_name',           u'worker_id',
                  u'platform', u'experiment_template',   u'experiment_exp_id',
                      u'data',             u'browser'],
      dtype='object')


Each row is one subject's data for one task. The columns are self-explanatory. The import things to note is that _all the data_ is in one column - the data column. Let's look at one row's data column.

In [19]:
print source_data.iloc[1]['experiment_exp_id']
print type(source_data.iloc[1]['data'])
source_data.iloc[1]['data']

angling_risk_task_always_sunny
<type 'list'>


[{u'current_trial': 0,
  u'dateTime': 1460134802612,
  u'trialdata': {u'internal_node_id': u'0.0-0.0-0.0',
   u'key_press': 13,
   u'rt': 29250,
   u'time_elapsed': 29251,
   u'trial_id': u'instruction',
   u'trial_index': 0,
   u'trial_type': u'poldrack-text'},
  u'uniqueid': u'583'},
 {u'current_trial': 1,
  u'dateTime': 1460134840600,
  u'trialdata': {u'internal_node_id': u'0.0-0.0-1.0',
   u'rt': 37968,
   u'time_elapsed': 67229,
   u'trial_id': u'instruction',
   u'trial_index': 1,
   u'trial_type': u'poldrack-instructions',
   u'view_history': u'[{"page_index":0,"viewing_time":37968}]'},
  u'uniqueid': u'583'},
 {u'current_trial': 2,
  u'dateTime': 1460134902643,
  u'trialdata': {u'exp_stage': u'practice',
   u'internal_node_id': u'0.0-1.0',
   u'key_press': 13,
   u'release': u'Keep',
   u'rt': 61037,
   u'time_elapsed': 129282,
   u'trial_id': u'intro',
   u'trial_index': 2,
   u'trial_type': u'poldrack-text',
   u'weather': u'Sunny'},
  u'uniqueid': u'583'},
 {u'current_trial'

We see that the data is quite large. After all, this cell holds all of the data for that task! The exact format of this data can differ depending on the _template_ used in expfactory. There are two main templates for us right now - jspsych and expfactory-survey.

Rather than explain how this data is stored, there are expanalysis commands to help you extract it. But before we get there let's finish loading the data!

Right now we have loaded _all_ of the data from expfactory, regardless of whether we want it or not. To filter it down to only the stuff we want, we have to make use of another function:

In [54]:
from expanalysis.experiments.utils import result_filter

#This filter ensures that all datasets come after this time
first_update_time = '2016-04-17T04:24:37.041870Z'
data = result_filter(source_data, battery = ['Self Regulation Pilot', 'Self Regulation Subset Battery'], \
                     finishtime = first_update_time)

`result_filter` can filter the data based on battery names, particular workers, particular experiments, or time. It can either take a string or array as the argument for each field and filter appropriately. (Though documentation is a work in progress, if I've been good you can see the way to use any function by typing `function?`

Now that we've filtered it, we are going to use a few more helper functions that add 'nice to have' features to the data.

In [55]:
from expanalysis.experiments.utils import anonymize_data
from expanalysis.experiments.jspsych import calc_time_taken, get_post_task_responses
worker_lookup = anonymize_data(data)
calc_time_taken(data)
get_post_task_responses(data)

These functions (in order)
- replace the worker_IDs with anonymized subject IDs and stores that relationship in a worker_lookup dictionary
- Adds columns to the data set for the total time taken for each task (when it is available, i.e. jspsych experiments), time on instructions and time on task
- Adds a column with the post task questionnaire responses

Below I've printed the columns (compare to the columns when we first uploaded the data) and used another helper function to print the time taken

In [28]:
print data.columns, '\n'

def print_time(data, time_col = 'ontask_time'):
    '''Prints time taken for each experiment in minutes
    :param time_col: Dataframe column of time in seconds
    '''
    df = data.copy()    
    assert time_col in df, \
        '"%s" has not been calculated yet. Use calc_time_taken method' % (time_col)
    #drop rows where time can't be calculated
    df = df.dropna(subset = [time_col])
    time = (df.groupby('experiment_exp_id')[time_col].mean()/60.0).round(2)
    print(time)
    return time


time_taken = print_time(data)

Index([         u'finishtime',        u'battery_name',           u'worker_id',
                  u'platform', u'experiment_template',   u'experiment_exp_id',
                      u'data',             u'browser',          u'total_time',
             u'instruct_time',         u'ontask_time', u'post_task_responses'],
      dtype='object') 

experiment_exp_id
adaptive_n_back                                15.85
angling_risk_task_always_sunny                 11.10
attention_network_task                         14.82
bickel_titrator                                 4.84
choice_reaction_time                            3.75
columbia_card_task_cold                         6.07
columbia_card_task_hot                          4.74
dietary_decision                                5.65
digit_span                                      8.69
directed_forgetting                            19.86
discount_titrate                                2.70
dot_pattern_expectancy                         17.74
go_no

Done loading the data!

## Accessing experiments

Above I mentioned that the data is stored in different ways depending on the experiment, and is in a relatively uuseable form (in one cell of a dataframe). Expanalysis has functions that let us extract experiments from either rows of the total dataframe (so returning one subject's data) or based on an experiment ID (returning a dataframe with all the data across ubjects for one experiment

Below are examples. Many expanalysis functions take a `clean` flag. If the `clean` flag is set to True (by default) then the data is transformed in many ways (which differ based on the experiment) before being returned. If you'd like to see the raw data (but still have the data returned as a pretty pandas dataframe), set `clean` to false.

In [36]:
from expanalysis.experiments.processing import extract_row,  extract_experiment

one_subject_df = extract_row(data.iloc[1])
# To extract one experiment, you need to provide the exact experiment ID. If you want to see what experiment IDs
# are in the dataframe, you can type 'data['experiment_exp_id'].unique() as below'
one_experiment_df = extract_experiment(data, 'stroop')

#viewing the experiments in the dataframe
print data.experiment_exp_id.unique(), '\n'

print 'Data for one subject for the %s' % data.iloc[1].experiment_exp_id, 'Dimensions: ', one_subject_df.shape
print 'Data for all subjects on the stroop task. Dimensions: ', one_experiment_df.shape

[u'adaptive_n_back' u'angling_risk_task_always_sunny'
 u'attention_network_task' u'bickel_titrator' u'bis11_survey'
 u'bis_bas_survey' u'brief_self_control_survey' u'choice_reaction_time'
 u'cognitive_reflection_survey' u'columbia_card_task_cold'
 u'columbia_card_task_hot' u'demographics_survey' u'dickman_survey'
 u'dietary_decision' u'digit_span' u'directed_forgetting'
 u'discount_titrate' u'dospert_eb_survey' u'dospert_rp_survey'
 u'dospert_rt_survey' u'dot_pattern_expectancy' u'eating_survey'
 u'erq_survey' u'five_facet_mindfulness_survey'
 u'future_time_perspective_survey' u'go_nogo' u'grit_scale_survey'
 u'hierarchical_rule' u'holt_laury_survey' u'impulsive_venture_survey'
 u'information_sampling_task' u'keep_track' u'kirby'
 u'leisure_time_activity_survey' u'local_global_letter'
 u'mindful_attention_awareness_survey' u'motor_selective_stop_signal'
 u'mpq_control_survey' u'probabilistic_selection'
 u'psychological_refractory_period_two_choices' u'ravens' u'recent_probes'
 u'select

We can look at the data for these subjects. I'll just use `the one_experiment_df` as an example. If your screen is wide enough, pandas prints very nicely. Otherwise it can be a bit messy looking. As usual, I'll print the columns first. Below I also print the columns for the stroop task _before_ it is "cleaned" so you can get a small sense of what is changing.

In [39]:
print 'Columns:', one_experiment_df.columns , '\n'
print 'Columns before cleaning:', extract_experiment(data,'stroop', clean = False).columns, '\n'
print one_experiment_df

Columns: Index([      u'battery_name',          u'condition',            u'correct',
         u'correct_response',          u'exp_stage',  u'experiment_exp_id',
               u'finishtime',       u'focus_shifts',        u'full_screen',
                u'key_press', u'possible_responses',                 u'rt',
               u'stim_color',          u'stim_word',       u'time_elapsed',
                 u'trial_id',         u'trial_type',          u'worker_id'],
      dtype='object') 

Columns before cleaning: Index([      u'battery_name',     u'block_duration',          u'condition',
                  u'correct',   u'correct_response',         u'credit_var',
                   u'exp_id',          u'exp_stage',  u'experiment_exp_id',
        u'feedback_duration',         u'finishtime',       u'focus_shifts',
              u'full_screen',   u'internal_node_id',          u'key_press',
       u'possible_responses',          u'responses',                 u'rt',
               u'stim_color',

Notice above that the _index_ of the dataframe (the left most column, which isn't really a column, but an index) has a particular structure. It is: 'exp ID' _ 'subject number relative to that experiment' _ 'trial'. Above you can see that there seem to be missing trials. That's because these 'trials' can sometimes just be one jspsych write. So for the stroop task a fixation is displayed (which we don't care about) and then a stim is displayed (where all the information we care about is gathered). The "cleaning" removes the rows we don't care about, making it look neater. Also note that the subject number here is **not** related to the worker_id (or the now anonymized subject ids). It is just an index of the different datasets. If we have 44 workers with the stroop task, the index will go from 0 to 43.

There's too much pandas specific syntax to go through (look that up yourself), but what we have here is just a dataframe. I'll do some basic manipulations below so you can get a sense for what you can do.

In [50]:
# groups the dataframe by the condition columns (which has two levels), and gets the median rt for each level
contrast1 = one_experiment_df.groupby('condition')['rt'].median(), '\n'

# same as above, but now splits up by worker
contrast2 = one_experiment_df.groupby(['worker_id','condition'])['rt'].median()

# you can get multiple variables
contrast3 = one_experiment_df.groupby('condition')[['rt','correct']].agg(['mean','median'])

# you can also just subset to something you care about. Below I used string substitution
worker_df = one_experiment_df.query('worker_id ==  "%s"' )  % 's001' 
# same thing, different way
worker_df2 = one_experiment_df[one_experiment_df['worker_id'] == 's001']

In [51]:
#replace variable below to look at the different outputs
print contrast1

(condition
congruent      634
incongruent    752
Name: rt, dtype: int64, '\n')


When you want to get a CSV file, you can use basic pandas syntax:

In [63]:
path_to_save_location = '~/'
one_experiment_df.to_csv(path_to_save_location + 'stroop.csv')

## Automated post processing

Every time you run `extract_experiment` it has to clean the data all over again. That cleaning can sometimes be costly. Though I am working on efficiency, occasionally this can take up to 1s for a single subject! I don't want to wait that long over and over! To deal with this there is another function that will post process your data and save it in the data matrix. For example, after we run this command, the data cell for row 2 will have the post processed data rather than the raw data.

_After this step the raw data is gone from this data set_. This is why I have separated out `source_data` from `data` above. `source_data` will maintain the raw data. Of course we could always load the data again, but I often am playing with the data in an active ipythong notebook, and this makes it easier.

Post processing is done using the command below. It can take a while, but it only has to run once! It will print out progress along the way. Some experiments have manipulation checks or other things that may print out - totally experiment dependent. 

You could save the data at this stage so that you don't have to go through all of these steps each time. I haven't done that yet, but probably will create a helper function to do so in the future.

In [None]:
from expanalysis.experiments.processing import post_process_data

post_process_data(data)

## Dependent Variables calculation

Dependent variables can be extracted for one experiment, or for many experiments (as seen below). `extract_DVs` both appends columns to the data dataframe that hold each subject/task dependent variable, as well as extract those dependent variables to a new dataframe that is subjects x dependent variables. This is the primary dataframe of interest (we will use it to create covariance matrixes).

These functions all rely on experiment specific functions defined in expanalysis.experiments.jspsych_processing which are all named 'calc_experiment ID_DV'. Though I haven't made one yet, there will be a separate file for surveys. The dependent variables currently calculated are somewhat haphazard. I do not expect all of them will actually be useful. For example, I calculated the number of missing responses for many tasks. I do not think that is a DV of scientific interest.

In [59]:
from expanalysis.experiments.processing import get_DV, extract_DVs
one_experiment_DV = get_DV(data,'stroop')
DV_df = extract_DVs(data)


adaptive_n_back : 2.66923189163
angling_risk_task_always_sunny : 7.21298599243
attention_network_task : 3.16573095322
bickel_titrator : 0.00623393058777
bis11_survey : 0.00827193260193
bis_bas_survey : 0.00872921943665
brief_self_control_survey : 0.00438189506531
choice_reaction_time : 1.9067530632
cognitive_reflection_survey : 0.754488945007
columbia_card_task_cold : 2.58764386177
DV calculated failed for worker: s008
DV calculated failed for worker: s030
columbia_card_task_hot : 2.98071098328
demographics_survey : 0.00792098045349
dickman_survey : 0.00814199447632
dietary_decision : 0.00728702545166
digit_span : 1.59749913216
directed_forgetting : 0.00633692741394
discount_titrate : 0.00797915458679
dospert_eb_survey : 0.0063738822937
dospert_rp_survey : 0.00571608543396
dospert_rt_survey : 0.00422286987305
dot_pattern_expectancy : 2.21447896957
eating_survey : 0.00528407096863
erq_survey : 0.00802302360535
five_facet_mindfulness_survey : 0.00877785682678
future_time_perspective_surv

ValueError: cannot index with vector containing NA / NaN values