# 3.3: Split-apply-combine in pandas

Now that we are (somewhat) comfortable with transformation of data between wide and long, we can get into another very powerful pandas feature known as split-apply-combine.

---

## Dataset

The three files are in your ```../assets/datasets/``` directory. They are:

- ```mach_data.csv``` which contains the wide data.
- ```mach_long.csv``` which contains the already long data.
- ```mach_codebook.csv``` which contains the information about the survey data.

---

## Packages

Loaded same as ever.

In [2]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# make sure charts appear in the notebook:
%matplotlib inline



---

## A: Load the already widened data

I have already widened the data for you here in the interest of time. You can see how I get started on the bonus 2 question below as well.

If you would like to do the melting part yourself for practice, be my guest! The more practice the better. If doing it yourself you would instead load the ```mach_data.csv``` file again.

In [22]:
mach_long = pd.read_csv('../datasets/mach_long.csv')
mach_long

Unnamed: 0,age,gender,subject_id,variable,value
0,24,1,1,Q1,4
1,33,2,2,Q1,2
2,21,1,3,Q1,3
3,17,1,4,Q1,4
4,22,1,5,Q1,4
5,21,1,6,Q1,5
6,55,1,7,Q1,4
7,40,1,8,Q1,1
8,30,2,9,Q1,3
9,65,1,10,Q1,2


In [23]:
print mach_long.variable.unique()

['Q1' 'Q2' 'Q3' 'Q4' 'Q5' 'Q6' 'Q7' 'Q8' 'Q9' 'Q10' 'Q11' 'Q12' 'Q13' 'Q14'
 'Q15' 'Q16' 'Q17' 'Q18' 'Q19' 'Q20' 'score' 'seconds_elapsed']


In [24]:
mach_long = mach_long[mach_long.variable != 'score']

seconds_data = mach_long.ix[mach_long.variable == 'seconds_elapsed', ['subject_id','variable','value']]

seconds_data['seconds_per_q'] = seconds_data.value / 20.

mach_long = mach_long[mach_long.variable != 'seconds_elapsed']

seconds_data.head()

Unnamed: 0,subject_id,variable,value,seconds_per_q
255906,1,seconds_elapsed,177,8.85
255907,2,seconds_elapsed,107,5.35
255908,3,seconds_elapsed,323,16.15
255909,4,seconds_elapsed,136,6.8
255910,5,seconds_elapsed,142,7.1


### A.2 : Sorting by multiple columns with custom sort for questions

We are going to use ```pd.Categorical``` to custom-sort the questions after subject

see:  http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe

Look stuff up online!!

In [25]:
categorical_questions = pd.Categorical(mach_long.variable,
                                       ['Q1','Q2','Q3','Q4',
                                        'Q5','Q6','Q7','Q8',
                                        'Q9','Q10','Q11','Q12',
                                        'Q13','Q14','Q15','Q16',
                                        'Q17','Q18','Q19','Q20'])

mach_long['variable'] = categorical_questions

mach_long.sort_values(['subject_id','variable'], inplace=True)

mach_long.head()

Unnamed: 0,age,gender,subject_id,variable,value
0,24,1,1,Q1,4
12186,24,1,1,Q2,4
24372,24,1,1,Q3,2
36558,24,1,1,Q4,2
48744,24,1,1,Q5,4


### A.3: Adding the time by question by subject_id as a new value

Merge the seconds data and the long data together. This is a preview to future lessons!

In [26]:
mach_secs = mach_long.merge(seconds_data[['subject_id', 'seconds_per_q']], on='subject_id', sort=False)

mach_secs

Unnamed: 0,age,gender,subject_id,variable,value,seconds_per_q
0,24,1,1,Q1,4,8.85
1,24,1,1,Q2,4,8.85
2,24,1,1,Q3,2,8.85
3,24,1,1,Q4,2,8.85
4,24,1,1,Q5,4,8.85
5,24,1,1,Q6,2,8.85
6,24,1,1,Q7,3,8.85
7,24,1,1,Q8,5,8.85
8,24,1,1,Q9,3,8.85
9,24,1,1,Q10,4,8.85


### A.4 Doing a cumulative sum of the seconds

Do a cumulative sum of the seconds by subject.

This is an example of the split-apply-combine pattern. We will do more as well.

#### A.4.1: Split the data into groups by subject_id

In [12]:
mach_split = mach_secs.groupby(['subject_id'])

#### A.4.2: Apply the cumulative sum function, iterating through the groups

the ```.transform()``` function on group objects requires a function as the argument.

In [19]:
mach_split.transform(np.cumsum)

Unnamed: 0,age,gender,variable,value,seconds_per_q
0,24,1,Q1,4,8.85
1,48,2,Q1Q2,8,17.7
2,72,3,Q1Q2Q3,10,26.55
3,96,4,Q1Q2Q3Q4,12,35.4
4,120,5,Q1Q2Q3Q4Q5,16,44.25
5,144,6,Q1Q2Q3Q4Q5Q6,18,53.1
6,168,7,Q1Q2Q3Q4Q5Q6Q7,21,61.95
7,192,8,Q1Q2Q3Q4Q5Q6Q7Q8,26,70.8
8,216,9,Q1Q2Q3Q4Q5Q6Q7Q8Q9,29,79.65
9,240,10,Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10,33,88.5


In [20]:
mach_secs_wide = pd.pivot_table(mach_secs,
                               values=['value','seconds_elapsed'],
                               index=['subject_id','age','gender'],
                               columns=['variable'])

In [21]:
mach_secs_wide

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,Unnamed: 1_level_1,variable,Q1,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q2,Q20,Q3,Q4,Q5,Q6,Q7,Q8,Q9
subject_id,age,gender,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
1,24,1,4,4,2,5,2,1,4,4,2,0,5,4,2,2,2,4,2,3,5,3
2,33,2,2,4,2,4,2,2,3,2,2,4,5,2,2,4,4,4,2,2,2,4
3,21,1,3,3,2,4,4,3,3,4,2,4,5,4,1,2,4,4,2,2,2,4
4,17,1,4,3,1,5,3,2,4,2,2,3,4,4,2,3,2,4,3,4,5,4
5,22,1,4,3,1,2,4,4,5,4,1,4,5,5,4,2,3,5,0,3,4,3
6,21,1,5,2,2,4,4,2,4,2,2,4,4,5,5,5,2,4,3,2,4,4
7,55,1,4,2,1,4,4,2,4,1,2,5,5,4,3,3,1,5,3,2,4,2
8,40,1,1,4,2,3,1,4,3,4,3,3,4,3,1,5,2,4,4,3,1,4
9,30,2,3,4,1,5,4,2,1,2,2,4,5,3,2,5,2,4,2,4,4,3
10,65,1,2,5,2,4,3,4,2,2,2,3,4,1,4,2,4,3,4,4,4,5


In [22]:
 s = pd.Series(["a","b","c","a"], dtype="category")

In [23]:
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [24]:
df = pd.DataFrame({"A":["a","b","c","a"]})

df["B"] = df["A"].astype('category')

In [26]:
df.dtypes

A      object
B    category
dtype: object

In [31]:
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})

labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]

df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

df.head(10)

Unnamed: 0,value,group
0,69,60 - 69
1,53,50 - 59
2,86,80 - 89
3,92,90 - 99
4,15,10 - 19
5,98,90 - 99
6,17,10 - 19
7,48,40 - 49
8,69,60 - 69
9,85,80 - 89


In [32]:
raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"], ordered=False)
s = pd.Series(raw_cat)
s

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]