# The groupby operation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

# Groupby

Groupby operations come up in a lot of contexts.
At its root, groupby about doing an operation on many subsets of the data, each of which shares something in common.
The components of a groupby operation are:

## Components of a groupby

1. **split** a table into groups
2. **apply** a function to each group
3. **combine** the results into a single DataFrame or Series

In pandas the `split` step looks like

```python
df.groupby( grouper )
```

`grouper` can be many things

- Series (or string indicating a column in `df`)
- function (to be applied on the index)
- dict : groups by *values*
- `levels=[ names of levels in a MultiIndex ]`

## Split

Break a table into smaller logical tables according to some rule

In [None]:
df = pd.read_csv('gapminder.tsv', delimiter='\t')
df.head(n=10)

In [None]:
df.groupby('year')

We haven't really done any actual work yet, but pandas knows what it needs to know to break the larger `df` into many smaller pieces, one for each distinct `year`.

## Apply & Combine

To finish the groupby, we apply a method to the groupby object.

In [None]:
df.groupby('year')['lifeExp'].agg('mean')

In this case, the function we applied was `'mean'`.
Pandas has implemented cythonized versions of certain common methods like mean, sum, etc.
You can also pass in regular functions like `np.mean`.

In terms of split, apply, combine, split was `df.groupby('year')`. 
We apply the `mean` function by passing in `'mean'`.
Finally, by using the `.agg` method (for aggregate) we tell pandas to combine the results with one output row per group.

You can also pass in regular functions like `np.mean`.

In [None]:
df.groupby('year')['lifeExp'].agg(np.mean).head()

Finally, [certain methods](http://pandas.pydata.org/pandas-docs/stable/api.html#id35) have been attached to `Groupby` objects.

In [None]:
df.groupby('year')['lifeExp'].mean()

<div class="alert alert-success" data-title="Highest Variance">
  <h1><i class="fa fa-tasks" aria-hidden="true"></i> Exercise: Highest Variance</h1>
</div>

<p>Find the `year`s with the greatest variance in `population`.</p>

- hint: `.std` calculates the standard deviation (`.var` for variance), and is available on `GroupBy` objects.
- hint: use `.sort_values` to sort a Series by the values (it took us a while to come up with that name)

<div class="alert alert-success" data-title="">
  <h1><i class="fa fa-tasks" aria-hidden="true"></i> Exercise: Multi Grouping</h1>
</div>

Group the data using `year` and `continent`, and find the `mean` of life expetancy and GDP

### Use matplotlib plot

In [None]:
gyle = df.groupby('year')['lifeExp'].mean()
gyle

In [None]:
gyle.plot()

### Saving Files

In [None]:
gyle

In [None]:
type(gyle)

In [None]:
df = gyle.reset_index()
df

In [None]:
df.to_csv('tes_output.csv')

In [None]:
!cat tes_output.csv

In [None]:
df.to_csv('tes_output.csv', index=False)

In [None]:
!cat tes_output.csv