# Tidy data

Hadley Wickham (ggplot, plyr, Rstudio) has written a nice and accessible paper called [Tidy Data](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf). In it he addresses the details of how to keep your research data in a tidy format and the importance of using tidy tools, which help with this task.

The notebooks in this tutorial provide an introduction to one of these tidy tools, the Python pandas library, and especially the dataframe object.

The notebooks focus on the mechanics of using pandas dataframes in particular. As you work through them try to keep in mind the larger context of tidy data, which can be summarized in four principles (copied from wikipedia):

1. Each variable you measure should be in one column.
1. Each different observation of that variable should be in a different row.
1. There should be one table for each "kind" of variable.
1. If you have multiple tables, they should include a column in the table that allows them to be linked.

Think of pandas as a kind of **data language** that you speak, and your goal is to aggregate diverse data sources into the objects of this language. Once you have done that you can focus on a small number of operations to do your data analysis.

# A teaser

What does the following code do?

In [1]:
import pandas as pd
from audiolabel import read_label

files = ('resource/two_plus_two_1.tg', 'resource/three_plus_five_1.tg')

# Read pyalign phone tiers into a dataframe. Phones are in arpabet transcription.
[phdf] = read_label(files, 'praat', tiers=['phone'], addcols=['barename'])

# Calculate phone durations and assign to 'dur' column.
phdf = phdf.assign(dur=phdf.t2 - phdf.t1)

# Extract and add 'token' and 'subject' columns based on bare filenames.
tokensubj = phdf.barename.str.extract(r'^(?P<token>.+)_(?P<subject>\d+)$', expand=True)
phdf = pd.concat([phdf, tokensubj], axis=1)

# Read mappings of arpabet to ipa symbols. Add 'ipa' column to phone dataframe based
# on arpabet transcription.
ph2ipa = pd.read_csv('resource/arpabet2ipa.txt', sep='\t', names=('arpa','ipa'))
phdf = pd.merge(phdf, ph2ipa, left_on='label', right_on='arpa')

In [2]:
# Report on mean duration of phones (in ipa transcription) by speaker.
phdf.groupby(['subject', 'ipa']).dur.mean()

subject  ipa
1        aɪ     0.359200
         eɪ     0.279300
         f      0.054900
         i      0.119733
         k      0.089800
         l      0.044925
         p      0.099750
         r      0.114750
         s      0.194600
         t      0.172933
         u      0.244450
         v      0.029900
         w      0.039900
         z      0.099750
         ɔ      0.199500
         ə      0.029900
         ʌ      0.064850
         θ      0.109700
Name: dur, dtype: float64