## Notes On Pandas

## [Selection](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#selection)

- Use `.loc[row_lables, column_labels]` for [label-based indexing](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
- Use `.iloc[row_positions, column_positions]` for [positional indexing](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)
- Use `.ix` to [mix label-based and positional index](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ix.html)

Explicit is better than implicit, so when using `.loc` and `.iloc`, always use `[ row_indices , column_indicies ]` accordingly

## Index

`Index`es are something of a peculiarity to pandas.
First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.

In pandas, `Index`es are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two `DataFrame`s or `Series`.

R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a `DataFrame` or `Series` with the `.index` attribute.

#### Basic Operations

`df.index`

[`df.set_index`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-an-index)

[`df.reset_index`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#reset-the-index)

`df.sort_index`

#### Boolean indexing

Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

## GROUP BY
Groupby is a fundamental operation to pandas and data analysis.

The components of a groupby operation are to

1. Split a table into groups
2. Apply a function to each groups
3. Combine the results

In pandas the first step looks like

```python
df.groupby( grouper )
```

`grouper` can be many things

- Series (or string indicating a column in `df`)
- function (to be applied on the index)
- dict : groups by *values*
- `levels=[]`, names of levels in a MultiIndex

After the group by, you can apply aggregation function, typically, you can:

* `df.groupby(grouper).column.<aggregation_function>`: this will only aggregte on a single column using a single agg function
* `df.groupby(grouper).<aggregation_function>`: this will apply the same aggregation function to all columns
* `df.groupby(grouper).column.agg(['agg_func_1', 'agg_func_2'...])`: this will only aggregate on a single column, but return multiple columns using different aggregation
* `df.groupby(grouper).agg(['agg_func_1', 'agg_func_2'...])`: this will apply to all columns, several aggregation functions for each column (multi-column-index)

## Tidy Data

### The Rules

In a tidy dataset...

1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

We'll cover a few methods that help you get there.

### Stack / Unstack
* melt / stack: wide to long
* pivot_table / unstack: long to wide