# Using the Pandas Python Data Toolkit

Today we will highlight some very useful and cool features of the Pandas library in Python while playing with some nematode worm behaviour data collected from the multi-worm-tracker (Swierczek et al., 2011).  

Specifically, we will explore:
    1. Loading data
    2. Dataframe data structures
    3. Element-wise mathematics
    4. Working with time series data
    5. Quick and easy visualization

## Some initial setup

In [None]:
## load libraries
%matplotlib inline
import pandas as pd
import numpy as np

from pandas import set_option
set_option("display.max_rows", 4)

## magic to time cells in ipython notebook
%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py
%load_ext autotime

## 1. Loading data from a local text file

More details, see http://pandas.pydata.org/pandas-docs/stable/io.html

Let's first load some behaviour data from a collection of wild-type worms.

In [None]:
filename = 'data/behav.dat'
behav = pd.read_table(filename, sep = '\s+')
behav

## 2. Dataframe data structures

For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html

Pandas provides access to data frame data structures. These tabular data objects allow you to mix and match arrays of different data types in one "table".

In [None]:
print behav.dtypes

## 3. Element-wise mathematics

Suppose we want to add a new column that is a combination of two columns in our dataset. Similar to `numpy`, `Pandas` lets us do this easily and deals with doing math between columns on an element by element basis. For example, We are interested in the ratio of the midline length divided by the morphwidth to look at whether worms are crawling in a straight line or curling back on themselves (*e.g.,* during a turn).

In [None]:
## vectorization takes 49.3 ms
behav['mid_width_ratio'] = behav['morphwidth']/behav['midline']
behav[['morphwidth', 'midline', 'mid_width_ratio']].head()

In [None]:
## looping takes 1 min 44s
mid_width_ratio = np.empty(len(behav['morphwidth']), dtype='float64')

for i in range(1,len(behav['morphwidth'])):
    mid_width_ratio[i] =+ behav.loc[i,'morphwidth']/behav.loc[i,'midline']
    
behav['mid_width_ratio'] = mid_width_ratio
behav[['morphwidth', 'midline', 'mid_width_ratio']].head()

### `apply()`
For more details, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Another bonus about using Pandas is the `apply` function - this allows you to apply any function to a select column(s) or row(s) of a dataframe, or accross the entire dataframe.

In [None]:
## custom function to center data
def center(data):
    return data - data.mean()

In [None]:
## center all data on a column basis
behav.iloc[:,4:].apply(center).head()

## 4. Working with time series data

### Indices
For more details, see http://pandas.pydata.org/pandas-docs/stable/indexing.html

Given that this is time series data we will want to set the index to time, we can do this while we read in the data.

In [None]:
behav = pd.read_table(filename, sep = '\s+', index_col='time')
behav

To utilize functions built into Pandas to deal with time series data, let's convert our
time to a date time object using the `to_datetime()` function.

In [None]:
behav.index.dtype

In [None]:
behav.index = pd.to_datetime(behav.index, unit='s')
print behav.index.dtype
behav

Now that our index is of datetime object, we can use the resample function to get time intervals. With this function you can choose the time interval as well as how to downsample (mean, sum, *etc.*)

In [None]:
behav_resampled = behav.resample('10s', how=('mean'))
behav_resampled

## 5. Quick and easy visualization
For more details, see: http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html

In [None]:
behav_resampled['angular_speed'].plot()

In [None]:
behav_resampled.plot(subplots=True, figsize = (10, 12))

In [None]:
behav_resampled[['speed', 'angular_speed', 'bias']].plot(subplots = True, figsize = (10,8))

## Summary

Pandas is a extremely useful and efficient tool for scientists, or anyone who needs to wrangle, analyze and visualize data!

#### Pandas is particularly attractive to scientists with minimal programming experience because:
* Strong, welcoming and growing community
* It is readable
* Idiom matches intuition

To learn more about Pandas see:
* [Pandas Documentation](http://pandas.pydata.org/)
* ipython notebook [tutorial](http://nsoontie.github.io/2015-03-05-ubc/novice/python/Pandas-Lesson.html) by Nancy Soontiens (Software Carpentry)
* Video [tutorial](https://www.youtube.com/watch?v=0CFFTJUZ2dc&list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu&index=12) from SciPy 2015 by Jonathan Rocher 
* [History of Pandas](https://www.youtube.com/watch?v=kHdkFyGCxiY) by Wes McKinney