# 01. Groupby Aggregation Basics

### Objectives

+ Define split, apply, combine and why it is useful for data analysis
+ Know the definition of an aggregation
+ Group by a single column
+ Aggregate a single column
+ Apply a single function
+ Use this syntax: **`df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})`**
+ For every group by aggregation, identify **grouping column**, **aggregating column**, and the **aggregating function**
+ Remove the grouping column from the index with **`reset_index`** method
+ Know that the `groupby` method returns a **GroupBy** object

### Resources
+ Read the pandas [split apply combine documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) stopping at 'transformation'.

### Introduction
In previous notebooks, when we called a method, such as **`sum`**, on our DataFrames, the action was performed to every single value in it. In this notebook, we will perform actions to distinct groupings within our data and not to the whole. Split-Apply-Combine is simply a recently popular term to describe this. You can also simply refer to it as **grouping** data.

#### Examples of questions we can answer
The split-apply-combine strategy can be used to answer questions such as:
* What is the maximum salary for every department at a company
* What is the average temperature and precipitation for every month for different cities
* Find the top 5 best selling shirts at each store

#### Definitions
* **Split** - The data is split into distinct and independent groups based on each member meeting a certain criteria
* **Apply** - Apply a function to each group independently
* **Combine** - Combine the results of the function applied to each group back together to form a single dataset again

![](images/split-apply-combine.png)

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 100

## NYC Leading Causes of Death Data
To get started with split-apply-combine, we will use a small dataset containing the leading causes of death in NYC from 2007-2014. [This dataset][1] may be found at the [NYC Open Data][2] site.

[1]: https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
[2]: https://opendata.cityofnewyork.us/

In [4]:
nyc = pd.read_csv('../data/nyc_deaths.csv')
nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


## Grouping with the **`groupby`** method
All of the tasks involving grouping involve the **`groupby`** method. It is responsible for splitting the data into independent groups, applying the desired function or functions to each group and combining the results back together.

### Aggregation
By far, the most common type of function to apply to each group is an aggregation function. As we have previously learned, to aggregate means to take all the values in the group and summarize them with a single value. Aggregations always return a single number for each group. Taking the sum, average, mean, min, max, standard deviation, count, etc.. are all examples of an aggregation. [See here for more.](https://en.wikipedia.org/wiki/Aggregate_function)


## Syntax for using the `groupby` method
The **`groupby`** method is not as straightforward to use as other Pandas. It will take more effort to learn how it works. And unfortunately, there are several different valid types of syntax that do the same the thing.

### Must use method chaining with `groupby`
Nearly all of the calls to the **`groupby`** must have another method chained to them to return a result.

### Performing an Aggregation with `agg`
To perform an aggregation, you must chain the **`agg`** method your call to **`groupby`**. The basic syntax will take the following form:

**```
df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})
```**

dictionary inside: {'<aggregating column>':'<aggregating function>'}

Let's see an example of this by finding the total number of deaths per year.

In [5]:
nyc.groupby('year').agg({'deaths':'sum'})

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


## Explanation
Every **`groupby`** aggregation has three separate pieces:
* **grouping column** - Every distinct value in this column forms its own group
* **aggregating column** - This is column we apply a function to such that it aggregates (returns a single value). This column is usually numeric.
* **aggregating function** - This is the function that is applied to the aggregating column.

## Always identify each piece
When facing a problem where you will be grouping and aggregating, it is important to identify each of the pieces. This will help you insert them in the right place of the syntax above. In the above example, we have:

* **grouping column** - **`year`**
* **aggregating column** - **`deaths`**
* **aggregating function** - **`sum`**

### Use string names for aggregation functions
Pandas understands many string aggregation functions. Below are most of the available string names you can use. 
+ **`sum`**
+ **`min`**
+ **`max`**
+ **`mean`**
+ **`median`**
+ **`std`**
+ **`var`**
+ **`count`** - count of non-missing values
+ **`size`** - count of all elements
+ **`first`** - first value in group
+ **`last`** - last value in group
+ **`idxmax`** - index of maximum value in group
+ **`idxmin`** - index of minimum value in group
+ **`any`** - checks for at least one True value - returns boolean
+ **`all`** - checks for at least one False value - returns boolean
+ **`nunique`** - number of unique values in group
+ **`sem`** - standard error of the mean

Later on we will see where these names came from.

## Find the maximum deaths for each leading cause
Identify each component of the aggregation:

* **grouping column** - **`cause`**
* **aggregating column** - **`deaths`**
* **aggregating function** - **`max`**

Then place each component in the proper place for the syntax:

In [6]:
nyc.groupby('cause').agg({'deaths':'max'}) # grouping col to be the index

Unnamed: 0_level_0,deaths
cause,Unnamed: 1_level_1
Accidents,297
Alzheimer's,276
Cancer,3518
Congenital Malformations,14
Diabetes,410
Flu and Pneumonia,707
HIV,377
Heart Disease,7050
Hepatitis,15
Homicide,299


## Deeper explanation on method chaining with `groupby`
The `groupby` syntax is a bit strange in that it requires method chaining to deliver results. Let's examine the results of making a call just to the **`groupby`** method.

In [7]:
nyc.groupby('year')

<pandas.core.groupby.DataFrameGroupBy object at 0x105e88748>

### What is that?
Calling **`groupby`** by itself does not do much. You are simply alerting pandas that you would like to create distinct groups with a particular column. It has formally returned a **`DataFrameGroupBy`** object. Just like all Pandas objects, you can see a list of all its [attributes and methods in the API][1]

### Assign the `groupby` object to a  variable
Let's assign the result of the call to **`groupby`** as a variable and verify its type.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

In [8]:
g = nyc.groupby('year')

In [9]:
type(g)

pandas.core.groupby.DataFrameGroupBy

## `GroupBy` objects
The documentation refers to the object returned from a call to the **`groupby`** method as a **GroupBy** object. Technically there are two specific objects - **`DataFrameGroupBy`** (as we saw above) and **`SeriesGroupBy`**. It's not necessary to think much about these objects. Just be aware that a call to **`groupby`** returns some other object that is not a DataFrame or a Series. It is a **GroupBy** object with its own attributes and methods.

## The `groups` attribute
The **`groups`** attribute of the GroupBy object. This is an interesting attribute - it is a dictionary that contains each individual group value as the key with its corresponding index labels of that group.

In [10]:
g.groups

{2007: Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
             ...
              96,  97,  98,  99, 100, 101, 102, 103, 104, 105],
            dtype='int64', length=106),
 2008: Int64Index([106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
             ...
             205, 206, 207, 208, 209, 210, 211, 212, 213, 214],
            dtype='int64', length=109),
 2009: Int64Index([215, 216, 217, 218, 219, 220, 221, 222, 223, 224,
             ...
             311, 312, 313, 314, 315, 316, 317, 318, 319, 320],
            dtype='int64', length=106),
 2010: Int64Index([321, 322, 323, 324, 325, 326, 327, 328, 329, 330,
             ...
             417, 418, 419, 420, 421, 422, 423, 424, 425, 426],
            dtype='int64', length=106),
 2011: Int64Index([427, 428, 429, 430, 431, 432, 433, 434, 435, 436,
             ...
             524, 525, 526, 527, 528, 529, 530, 531, 532, 533],
            dtype='int64', length=107),
 2012: Int64Index([534, 535, 536, 537, 538, 539, 5

## Calling the `agg` method from the GroupBy object
We can call the **`agg`** method from this assigned variable (the GroupBy object) to get the same result as above.

In [11]:
g.agg({'deaths':'sum'})

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


## Understanding the index
If you were paying close attention, you would notice that the grouping column gets placed in the index. In our above example, the **`year`** is the now the index. It is not a column.

In [12]:
year_deaths = nyc.groupby('year').agg({'deaths':'sum'})
year_deaths

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


### Use `reset_index` method to turn the index as a column
All DataFrames come equipped with a **`reset_index`** method which makes the index into a column of data. You can chain it after the call to **`agg`**.

In [13]:
nyc.groupby('year').agg({'deaths':'sum'}).reset_index()

Unnamed: 0,year,deaths
0,2007,53996
1,2008,54138
2,2009,52820
3,2010,52505
4,2011,52726
5,2012,52420
6,2013,53387
7,2014,53006


# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

In [39]:
nyc.groupby('year').agg({'deaths':'sum'}).sort_values(by = 'deaths', ascending=False).head(1)

Unnamed: 0_level_0,deaths
year,Unnamed: 1_level_1
2008,54138


In [40]:
nyc.groupby('year').agg({'deaths':'sum'}).sort_values(by = 'deaths', ascending=False).head(1).index.values

array([2008])

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [24]:
nyc.groupby('race').agg({'deaths':'sum'}).sort_values(by = 'deaths', ascending = False)

Unnamed: 0_level_0,deaths
race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian,26355
Unknown,6238


### Use the employee dataset for the remaining problems

In [25]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22


### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

In [28]:
emp.groupby('gender').agg({'salary':'max'})

Unnamed: 0_level_0,salary
gender,Unnamed: 1_level_1
Female,178331.0
Male,275000.0


### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [30]:
emp.groupby('dept').agg({'salary':'median'})

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Admn. & Regulatory Affairs,37710.0
City Controller's Office,57054.0
City Council,54000.0
Convention and Entertainment,38397.0
Dept of Neighborhoods (DON),43742.0
Finance,80542.0
Fleet Management Department,44158.0
General Services Department,42473.5
Health & Human Services,46717.0
Housing and Community Devp.,57284.5


### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [37]:
emp.groupby('race').agg({'salary':'mean'})

Unnamed: 0_level_0,salary
race,Unnamed: 1_level_1
Asian,61660.304762
Black,50137.801493
Hispanic,52345.562771
Native American,60272.1
Other,51278.0
White,64419.799012


In [36]:
emp.groupby('race').agg({'salary':'mean'}).reset_index()

Unnamed: 0,race,salary
0,Asian,61660.304762
1,Black,50137.801493
2,Hispanic,52345.562771
3,Native American,60272.1
4,Other,51278.0
5,White,64419.799012
