In [1]:
from IPython.core.display import HTML; HTML(open("styles/custom.css", "r").read())

# Pandas & DataFrames
In every domain there are algorithms and data structures that seem to pop up over
and over again. We have already seen the most important data structure for most 
scientific computing: arrays.

However, another equally useful structure is the `DataFrame`. This is similar to structured arrays or tables, but with added  capabilities that make them invaluable for experimental data. A relative newcomer, the data frame is a must-have data structure for data analysis.

In [None]:
import pandas as pd

## Series
The Series class in pandas is effectively a one-dimensional NumPy array with an
optional associated index. 

A series may be created using array-like mechanisms, and they share the same 
primitive dtype system that NumPy arrays use. The following example creates 
a series of 64-bit floats:

In [None]:
pd.Series([42, 43, 44], dtype='f8')

The following shows a series s with various particle names used as the index and the
values representing the number of the associated particle that a detector has seen:

In [None]:
s = pd.Series([42, 43, 44], 
      index=["electron", 
             "proton", 
             "neutron"])

In [None]:
s

In [None]:
s['electron']

In [None]:
# inclusive bounds
s['electron':'proton']

In [None]:
# integer indexing still OK
s[1:]

Series may also be created from dictionaries.

In [None]:
t = pd.Series({'electron': 6, 
              'neutron': 28, 
              'proton': 496, 
              'neutrino': 8128})

In [None]:
t

Arithmetic can be peformed on series, on the index. 
Can you explain why neutrino is NaN in the following case?

In [None]:
s + t

## DataFrames

We can create a data frame from a dictionary of arrays, lists, or series. The
keys of the dictionary become the column names. 

1) Reusing the definitions of s and t from before, create a data frame called df.

In [None]:
df = pd.DataFrame({'S': s, 'T': t})

In [None]:
df

2) Slice every other element

In [None]:
df[::2]

3) Add a new index to the data frame and value to S.

In [None]:
dg = df.append(pd.DataFrame({'S': [-8128]}, index=['antineutrino']))
dg

4) Delete the neutron index.

In [None]:
dh = df.drop('neutron')
dh

You may also easily transpose the rows and columns via the T attribute, as seen here:

In [None]:
df.T

## Masks
Boolean logic can be applied to a whole data frame. This creates a 'Mask.'

In [None]:
df < 42

### Using a Mask
1. Create a mask called mymask that returns true for even numbers, false for odd.
2. Call mask() on the df, providing mymask as an argument. 
3. What is the result? Why?
4. Discuss with your neighbor... How might masks be useful in analysis?

In [None]:
mymask = df % 2 == 0
mymask
df.mask(mymask)

In [None]:
# accessing a single column 
# will return a series
df['T']

In [None]:
# setting a name to a series
# or expression will add a 
# column to the frame
df['small'] = df['T'] < 100
df

In [None]:
# deleting a column will
# remove it from the frame
del df['small']
df

## Data Structures Wrap-up


You should now be familiar with the following
concepts:
- Series are like NumPy arrays but with more generic indexing capabilities.
- DataFrames are like tables with series as columns.
- DataFrames handle missing data through NaN values.