# 3.3: Split-apply-combine in pandas

Now that we are (somewhat) comfortable with transformation of data between wide and long, we can get into another very powerful pandas feature known as split-apply-combine.

---

## Dataset

The three files are in your ```../assets/datasets/``` directory. They are:

- ```mach_data.csv``` which contains the wide data.
- ```mach_long.csv``` which contains the already long data.
- ```mach_codebook.csv``` which contains the information about the survey data.

---

## Packages

Loaded same as ever.

In [1]:
!pwd

/Users/smoot/Desktop/ga/DSI_SM_01/curriculum/week-02/3.3-lesson/code


In [16]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# make sure charts appear in the notebook:
%matplotlib inline

---

## A: Load the already widened data

I have already widened the data for you here in the interest of time. You can see how I get started on the bonus 2 question below as well.

If you would like to do the melting part yourself for practice, be my guest! The more practice the better. If doing it yourself you would instead load the ```mach_data.csv``` file again.

In [17]:
mach_long = pd.read_csv('../mach_long.csv')
mach_long.head()

Unnamed: 0,age,gender,subject_id,variable,value
0,24,1,1,Q1,4
1,33,2,2,Q1,2
2,21,1,3,Q1,3
3,17,1,4,Q1,4
4,22,1,5,Q1,4


In [19]:
print mach_long.variable.unique()

['Q1' 'Q2' 'Q3' 'Q4' 'Q5' 'Q6' 'Q7' 'Q8' 'Q9' 'Q10' 'Q11' 'Q12' 'Q13' 'Q14'
 'Q15' 'Q16' 'Q17' 'Q18' 'Q19' 'Q20' 'score' 'seconds_elapsed']


In [22]:
mach_long = mach_long[mach_long.variable != 'score']

seconds_data = mach_long.ix[mach_long.variable == 'seconds_elapsed', ['subject_id','variable','value']]

seconds_data['seconds_per_q'] = seconds_data.value / 20.

mach_long = mach_long[mach_long.variable != 'seconds_elapsed']

seconds_data

Unnamed: 0,subject_id,variable,value,seconds_per_q


### A.2 : Sorting by multiple columns with custom sort for questions

We are going to use ```pd.Categorical``` to custom-sort the questions after subject

see:  http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe

Look stuff up online!!

In [6]:
categorical_questions = pd.Categorical(mach_long.variable,
                                       ['Q1','Q2','Q3','Q4',
                                        'Q5','Q6','Q7','Q8',
                                        'Q9','Q10','Q11','Q12',
                                        'Q13','Q14','Q15','Q16',
                                        'Q17','Q18','Q19','Q20'])

mach_long['variable'] = categorical_questions

mach_long.sort_values(['subject_id','variable'], inplace=True)

mach_long.head()

Unnamed: 0,age,gender,subject_id,variable,value
0,24,1,1,Q1,4
12186,24,1,1,Q2,4
24372,24,1,1,Q3,2
36558,24,1,1,Q4,2
48744,24,1,1,Q5,4


### A.3: Adding the time by question by subject_id as a new value

Merge the seconds data and the long data together. This is a preview to future lessons!

In [7]:
mach_secs = mach_long.merge(seconds_data[['subject_id', 'seconds_per_q']], on='subject_id', sort=False)

mach_secs.head()

Unnamed: 0,age,gender,subject_id,variable,value,seconds_per_q
0,24,1,1,Q1,4,8.85
1,24,1,1,Q2,4,8.85
2,24,1,1,Q3,2,8.85
3,24,1,1,Q4,2,8.85
4,24,1,1,Q5,4,8.85


### A.4 Doing a cumulative sum of the seconds

Do a cumulative sum of the seconds by subject.

This is an example of the split-apply-combine pattern. We will do more as well.

#### A.4.1: Split the data into groups by subject_id

In [8]:
mach_secs[]

SyntaxError: invalid syntax (<ipython-input-8-8b16c88e8798>, line 1)

In [None]:
grouped_data = mach_secs.groupby(by=['subject_id'])
[i for i in grouped_data]

In [24]:
#pd.pivot_table(mach_secs, index = 'subject_id').head()
#print seconds_data.head()
# print mach_secs.head()
mach_split= mach_secs.groupby(['subject_id'])
mach_split

<pandas.core.groupby.DataFrameGroupBy object at 0x102390c10>

#### A.4.2: Apply the cumulative sum function, iterating through the groups

the ```.transform()``` function on group objects requires a function as the argument.

In [25]:
mach_split.transform(np.cumsum)

Unnamed: 0,age,gender,seconds_per_q,value
0,24,1,8.85,4
1,48,2,17.70,8
2,72,3,26.55,10
3,96,4,35.40,12
4,120,5,44.25,16
5,144,6,53.10,18
6,168,7,61.95,21
7,192,8,70.80,26
8,216,9,79.65,29
9,240,10,88.50,33
