# Lab 10: Intro to split-apply-combine operations for tabular data

In [1]:
import pandas as pd

In [2]:
data = pd.DataFrame(
    data=[
        ['312', 'A1', 0.12, 'LEFT'],
        ['312', 'A2', 0.37, 'LEFT'],
        ['312', 'C2', 0.68, 'LEFT'],
        ['313', 'A1', 0.07, 'RIGHT'],
        ['313', 'B1', 0.08, 'RIGHT'],
        ['314', 'A2', 0.29, 'LEFT'],
        ['314', 'B1', 0.14, 'RIGHT'],
        ['314', 'C2', 0.73, 'RIGHT'],
        ['711', 'A1', 4.01, 'RIGHT'],
        ['712', 'A2', 3.29, 'LEFT'],
        ['713', 'B1', 5.74, 'LEFT'],
        ['714', 'B2', 3.32, 'RIGHT'],
    ],
    columns=['subject_id', 'condition_id', 'response_time', 'response'],
)
data

Unnamed: 0,subject_id,condition_id,response_time,response
0,312,A1,0.12,LEFT
1,312,A2,0.37,LEFT
2,312,C2,0.68,LEFT
3,313,A1,0.07,RIGHT
4,313,B1,0.08,RIGHT
5,314,A2,0.29,LEFT
6,314,B1,0.14,RIGHT
7,314,C2,0.73,RIGHT
8,711,A1,4.01,RIGHT
9,712,A2,3.29,LEFT


# Group-by

We want to compute the mean response time by condition.

Let's start by doing it by hand, using for loops!

In [None]:
# set up the conditions to loop through
conditions = data['condition_id'].unique()

# set up a dicitonary to store the results
results_dict = {}


for condition in conditions:
    group = #  .... pull out just those data for one condition (using "condition")
    results_dict[condition] =  # ... apply the .mean() function to the group you pulled out and save it

results = pd.DataFrame([results_dict], index=['response_time'])

In [None]:
results # print the results

In [None]:
results = results.T # transpose the data (flip it) to show the results more clearly 

This is a basic operation, and we would need to repeat this pattern a million times in most analyses!

Pandas and all other tools for tabular data provide a command for performing operations on groups.
Let's do that instead...

In [None]:
# df.groupby(column_name) groups a DataFrame by the values in the column
# Group by condition using the .groupby() function


In [None]:
# The group-by object can by used as a DataFrame, so we can chain directly onto this.
# Operations are executed on each group individually, then aggregated

# start with the .size() condition

In [None]:
# try the applying the .mean() function to the response time grouped by condition


In [None]:
# try the applying the .max() function to the response time grouped by condition


# Pivot tables - review again and compare!

We want to look at response time biases when the subjects respond LEFT vs RIGHT. In principle, we expect them to have the same response time in both cases.

We compute a summary table with 1) condition_id on the rows; 2) response on the columns; 3) the average response time for all experiments with a that condition and response

We can do it with `groupby`, with some table manipulation commands.

In [5]:
summary = data.groupby(['condition_id', 'response'])['response_time'].mean()
summary

condition_id  response
A1            LEFT        0.120000
              RIGHT       2.040000
A2            LEFT        1.316667
B1            LEFT        5.740000
              RIGHT       0.110000
B2            RIGHT       3.320000
C2            LEFT        0.680000
              RIGHT       0.730000
Name: response_time, dtype: float64

In [17]:
summary.unstack(level=1) # play around with level here - what if level = 0 - can you figure out what the method is doing? 

condition_id,A1,A2,B1,B2,C2
response,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LEFT,0.12,1.316667,5.74,,0.68
RIGHT,2.04,,0.11,3.32,0.73


But `pivot_table` can also be used to perform this kind of operation straightforwardly.

In [None]:
data.pivot_table(index='condition_id', columns='response', values='response_time', aggfunc='mean')

In [None]:
(
    data
    .pivot_table(
        index='condition_id', 
        columns='response', 
        values='response_time', 
        aggfunc=['mean', 'std', 'count'],
    )
)