Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [2]:
NAME = ""
COLLABORATORS = ""

---

# A quick look at Pandas GroupBy

In [3]:
import numpy as np
import pandas as pd

Let's make a toy DF (example inspired by Wes McKinney's [Python for Data Analysis](http://proquest.safaribooksonline.com.libproxy.berkeley.edu/book/programming/python/9781491957653):

In [4]:
df = pd.read_csv("elections.csv")
df.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


Let's group the `%` column by the `Party` column. A call to [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) does that, but what is the object that results?

In [5]:
percent_grouped_by_party = df['%'].groupby(df['Party'])
percent_grouped_by_party.groups

{'Democratic': Int64Index([1, 4, 6, 7, 10, 13, 15, 17, 19, 21], dtype='int64'),
 'Independent': Int64Index([2, 9, 12], dtype='int64'),
 'Republican': Int64Index([0, 3, 5, 8, 11, 14, 16, 18, 20, 22], dtype='int64')}

As we see, `percent_grouped_by_party` is **NOT** a new DataFrame. Instead, it's a `SeriesGroupBy` object. A `SeriesGroupBy` consists of `groups`, one for each of the distinct values of the `Party` column. If we ask to see these groups, we'll be able to see which indices in the original DataFrame correspond to each group.

In [6]:
percent_grouped_by_party.groups

{'Democratic': Int64Index([1, 4, 6, 7, 10, 13, 15, 17, 19, 21], dtype='int64'),
 'Independent': Int64Index([2, 9, 12], dtype='int64'),
 'Republican': Int64Index([0, 3, 5, 8, 11, 14, 16, 18, 20, 22], dtype='int64')}

The `percent_grouped_by_party` object is capable of making computations across all these groups. For example, if we call the `mean` function, we'll get the mean of the "Democratic" `Series`, the mean of the "Independent" `Series`, and the mean of the "Republican" `Series`.

In [7]:
percent_grouped_by_party.mean()

Party
Democratic     46.53
Independent    11.30
Republican     47.86
Name: %, dtype: float64

You'll pretty much never do this when working with real data, but we can iterate over a `groupby` object. As we iterate we get pairs of `(name, group)`, where `name` is a String label for the group, and `group` is a `Series` corresponding to the all the values from the given group.

In [8]:
from IPython.display import display  # like print, but for complex objects

for name, group in grouped:
    print('Name:', name)
    print(type(group))
    display(group)

NameError: name 'grouped' is not defined

We can also group by multiple columns. For example, suppose we want to track all combinations of `{'Democratic', 'Republican', and 'Independent'}` and `{'win', 'loss'}`. 

In [9]:
g2 = df['%'].groupby([df['Party'], df['Year']])
g2.groups

{('Democratic', 1980): Int64Index([1], dtype='int64'),
 ('Democratic', 1984): Int64Index([4], dtype='int64'),
 ('Democratic', 1988): Int64Index([6], dtype='int64'),
 ('Democratic', 1992): Int64Index([7], dtype='int64'),
 ('Democratic', 1996): Int64Index([10], dtype='int64'),
 ('Democratic', 2000): Int64Index([13], dtype='int64'),
 ('Democratic', 2004): Int64Index([15], dtype='int64'),
 ('Democratic', 2008): Int64Index([17], dtype='int64'),
 ('Democratic', 2012): Int64Index([19], dtype='int64'),
 ('Democratic', 2016): Int64Index([21], dtype='int64'),
 ('Independent', 1980): Int64Index([2], dtype='int64'),
 ('Independent', 1992): Int64Index([9], dtype='int64'),
 ('Independent', 1996): Int64Index([12], dtype='int64'),
 ('Republican', 1980): Int64Index([0], dtype='int64'),
 ('Republican', 1984): Int64Index([3], dtype='int64'),
 ('Republican', 1988): Int64Index([5], dtype='int64'),
 ('Republican', 1992): Int64Index([8], dtype='int64'),
 ('Republican', 1996): Int64Index([11], dtype='int64'),

Given this groupby object, we can compute the average percentage earned every time each of the parties won and lost the presidential election. We see that at least between 1980 and 2016, the Republicans have typically lost and won their elections by wider margins.

In [10]:
g2.mean()

Party        Year
Democratic   1980    41.0
             1984    37.6
             1988    45.6
             1992    43.0
             1996    49.2
             2000    48.4
             2004    48.3
             2008    52.9
             2012    51.1
             2016    48.2
Independent  1980     6.6
             1992    18.9
             1996     8.4
Republican   1980    50.7
             1984    58.8
             1988    53.4
             1992    37.4
             1996    40.7
             2000    47.9
             2004    50.7
             2008    45.7
             2012    47.2
             2016    46.1
Name: %, dtype: float64

## DataFrameGroupBy

We can also group an entire dataframe by one or more columns. This results in a `DataFrameGroupBy` object as the result:

In [11]:
everything_grouped_by_party = df.groupby('Party')
everything_grouped_by_party

<pandas.core.groupby.DataFrameGroupBy object at 0x7faaf283e908>

As in our previous example, this object contains three `group` objects, one for each party label.

In [12]:
everything_grouped_by_party.groups

{'Democratic': Int64Index([1, 4, 6, 7, 10, 13, 15, 17, 19, 21], dtype='int64'),
 'Independent': Int64Index([2, 9, 12], dtype='int64'),
 'Republican': Int64Index([0, 3, 5, 8, 11, 14, 16, 18, 20, 22], dtype='int64')}

Just as with `SeriesGroupBy` objects, we can iterate over a `DataFrameGroupBy` object to understand what is effectively inside.

In [None]:
for n, g in k1g:
    print('name:', n)
    display(g)

And just like `SeriesGroupBy` objects, we can apply functions like `mean` to compute the mean of each group. Since a `DataFrameGroupBy` is linked to the entire original dataframe (instead of to a single column from the dataframe), we calculate a mean for every numerical column. In this example below, we get the mean vote earned (as before), and the mean year (which isn't a useful quantity).

In [13]:
everything_grouped_by_party.mean()

Unnamed: 0_level_0,%,Year
Party,Unnamed: 1_level_1,Unnamed: 2_level_1
Democratic,46.53,1998.0
Independent,11.3,1989.333333
Republican,47.86,1998.0


Where did all the other columns go in the mean above? They are *nuisance columns*, which get automatically eliminated from an operation where it doesn't make sense (such as a numerical mean).

Both `SeriesGroupBy` and `DataFrameGroupBy` objects have lots of handy functions for computing aggregate values for groups. A few are demoed below.

In [None]:
everything_grouped_by_party.min()

In [None]:
everything_grouped_by_party.size()

We can even define our own custom aggregation functions. For example, the function below returns the first item in a series.

In [None]:
def average_of_first_and_last(series):
    return (series.iloc[0] + series.iloc[-1])/2

We can supply this function as a custom aggregation function for each series. As you can see, nuisance columns are automatically removed.

In [None]:
everything_grouped_by_party.agg(average_of_first_and_last)

## Grouping over a different dimension (extra)

Above, we've been grouping data along the rows, using column keys as our selectors.  But we can also group along the columns, for example we can group by data type:

In [None]:
df.dtypes

In [None]:
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(dtype)
    display(group)

In [None]:
grouped_by_two=df.groupby([df['Party'], df['Year']])

In [None]:
grouped_by_two.sum()

## Submission

You're done!

Before submitting this assignment, ensure to:

1. Restart the Kernel (in the menubar, select Kernel->Restart & Run All)
2. Validate the notebook by clicking the "Validate" button

Finally, make sure to **submit** the assignment via the Assignments tab in Datahub