In [None]:
import os

try:
    from dotenv import load_dotenv
    load_dotenv()
except:
    print('Hey, check your data folder (DATA_FOLDER)!!')
    
finally:
    DATA_FOLDER = os.getenv('DATA_FOLDER', '../../data') + '/09_28_2019'

In [None]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt

%matplotlib inline

<hr>

### Read in the data

In [None]:
donations = pd.read_csv(f'{DATA_FOLDER}/processed_donations.csv')
donations.head()

<hr>

## [Pandas GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)



The docs suggest 3 main parts to the groupby object.
1. How to access groups, and iteration
2. Ways to apply functions to groups
3. Available computations / stats that are supported by groupby objects

In [None]:
yearly_groupby = donations.groupby('year')

<hr>

## Accessing groups

### By _Row Label_

In [None]:
yearly_groupby.groups

### By _Row Index_

In [None]:
yearly_groupby.indices

Just a dict! 
This makes it really easy to select specific groups individually.

In [None]:
type(yearly_groupby.groups)

In [None]:
yearly_groupby.groups[2012]

Let's look at the year of 2012.
We know that we can select rows based on Index (label or value).. 
So, we should be able to use `.loc` to reference all the rows from the 2012 group.

### Accessing group from original dataframe

In [None]:
# query for rows in the 2012 group, and give me all the columns
donations.loc[yearly_groupby.groups[2012], :]

### Accessing multiple groups..

In [None]:
# add the sets of indices together into one list
group_2012_2013 = np.concatenate((yearly_groupby.groups[2012], yearly_groupby.groups[2013]), axis=None)

donations.loc[group_2012_2013, :]

<hr>

## Iteration
What if we wanted to perform the same operation over all the groups?

```
GroupBy.__iter__::
    Generator yielding sequence of (name, subsetted object)
```

In [None]:
for name, subsetted_object in yearly_groupby:
    print(name)
    print(f'Name type: {type(name)}')

In [None]:
for name, subsetted_object in yearly_groupby:
    print(subsetted_object.shape)
    print(f'subsetted object type: {type(subsetted_object)}')

<hr>

## Ok, what about computations?

### Basic [computations / stats](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats)

In [None]:
# notice how all of the columns are some sort of numeric type.. Where is ID?
yearly_groupby.sum()

In [None]:
# on a single column --> Returns a (named) series!
yearly_groupby.amount.sum()

In [None]:
type(yearly_groupby.amount.sum())

In [None]:
# specified columns --> Returns a dataframe!
yearly_groupby[['amount', 'second']].sum()

In [None]:
type(yearly_groupby[['amount', 'second']].sum())

Are there other ways to write this? Yep

### All the ways to do the same thing..

In [None]:
yearly_groupby.amount.apply(np.sum)

In [None]:
yearly_groupby.amount.agg(np.sum)

In [None]:
yearly_groupby.amount.agg(summed_amount=np.sum)

In [None]:
yearly_groupby.amount.aggregate(np.sum)

In [None]:
yearly_groupby.amount.pipe(np.sum)

Great!!
Now we know 5 ways to do the same thing.
What is the use of that??

*Why use one over another*:

* `Groupby.<function>()` -- Use for a short win! This is great if you already have a column selected, and you only want to do a single computation, and you don't care about the name.
    
* `GroupBy.agg(..)` -- This is my favorite overall. You know what to expect, you can specify names, and perform multiple operations all in one go. 

* `GroupBy.apply(func)` -- Create your own function to use as an aggregation! It must take the grouped dataframe as an argument, and return a similar type (dataframe, series). Pandas will know how to join it back together. However, it's slow.

* `GroupBy.pipe(..)[.pipe(..)]` -- This is great when you want to "pipe (or chain)" multiple aggregations together. Maybe a second aggregation is based off the first, so you need to reference it after the fact.

In my opinion.. Erase the remainder from your mind.

### More complex examples

In [None]:
donations.head()

In [None]:
yearly_aggs = yearly_groupby.agg(
    {
        'id': {
            'nunique': 'nunique'
        },
        'month': {
            'most_common': lambda x: x.value_counts().index[0],
            'most_uncommon': lambda x: x.value_counts().index[-1],
        },
        'day': {
            'most_common': lambda x: x.value_counts().index[0],
            'most_uncommon': lambda x: x.value_counts().index[-1],
        }
    }
)

In [None]:
yearly_aggs

Draw histograms of the results..

In [None]:
yearly_aggs.hist(figsize=(20, 10));

<hr>

### Fix the MultiIndex

In [None]:
yearly_aggs.columns

In [None]:
yearly_aggs.columns = [f'{agg}__{col}' for col, agg in yearly_aggs.columns]
yearly_aggs.columns

Convert to a normal dataframe..

In [None]:
yearly_aggs.reset_index()