# Grouping Aggregation Basics

In previous chapters, when we called a method, such as `sum`, on our DataFrames, the action was performed to every single value in it as a whole. In this chapter, we will perform actions to distinct groups within our data and not to the whole.

## Group into independent DataFrames, then aggregate

Take a look at the image below. We will group our original data into independent DataFrames based on the unique values of one or more columns. Below, our groups are based on the value of the 'Dept' column. Once the data is split into these independent DataFrames, the aggregation is performed on each.

![](images/group_aggregate.png)

### Examples of questions we can answer

Grouping data is an extremely common technique used in data analysis and can help us answer a variety of questions. Some examples follow:

* What is the maximum salary for every department at a company?
* What is the average temperature and precipitation for every month for different cities?
* What are the top 5 best selling shirts at each store?

## Grouping with the `groupby` method

The `groupby` method handles most of the tasks in pandas involving grouping data. This one method is responsible for grouping the data into independent DataFrames as well performing the aggregation, and usually does so in a single line of code.

### Aggregation

By far, the most common type of function to call on each group is a kind of aggregation, though it's possible to manipulate the data in each group any way you want. This chapter only covers how to perform aggregations on each group.

## Syntax for using the `groupby` method

The `groupby` method is not as straightforward to use as most other methods and will take more effort to learn. Making it even more difficult is the different valid types of syntax that do the same the thing. Only one version of the syntax will be covered at first with the others delegated to a separate chapter.

### Must use method chaining with `groupby`

Nearly all of the calls to `groupby` must have another method chained to it to return a result. The `groupby` method is a **two-step process**. First, we inform pandas how we would like to group and then chain the `agg` method to inform pandas how to aggregate. The general syntax takes on the following form.

```python
df.groupby('<grouping column>').agg(<new_column_name>=('<agg column>', '<agg func>'))
```

### First groupby aggregation

Let's begin our usage of the `groupby` method by finding the average salary of every employee by department.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Using the syntax from above, we produce the following:

In [None]:
emp.groupby('dept').agg(avg_salary=('salary', 'mean')).round(-3)

## Grouping Column, Aggregating Column, Aggregating Function

Every groupby aggregation has three separate components - the grouping column, the aggregating column, and the aggregating function.

* **Grouping column** - Every distinct value in this column forms its own group
* **Aggregating column** - This is the column we are applying the function to such that it aggregates (returns a single value). This column is usually numeric.
* **Aggregating function** - This is the function that is applied to the aggregating column.

### Identify each piece

When completing a groupby aggregation, it is important to identify each of the pieces. This will help you insert them in the right place of the syntax above. In the above example:

* **Grouping column** - `dept`
* **Aggregating column** - `salary`
* **Aggregating function** - `mean`

### Use string names for aggregation functions

Notice that a string was used to identify the aggregation in the syntax above. pandas understands many string aggregation functions. Below are most of the available string names you can use. Later on we will see where these names came from and how to discover them on your own.

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std`
* `var`
* `count` - count of non-missing values
* `size` - count of all elements
* `first` - first value in group
* `last` - last value in group
* `idxmax` - index of maximum value in group
* `idxmin` - index of minimum value in group
* `nunique` - number of unique values in group

### New column name

You must supply a new column name when using the syntax introduced in this chapter for the resulting aggregation. The new column name is a parameter name and therefore cannot have spaces.

### Tuple of aggregating column and aggregating function

The parameter used as the new column name must be set equal to a two-item tuple of the aggregating column and the aggregating function.

### More examples

Let's complete a couple more examples with different grouping columns and different aggregating functions. The salary column is the only numeric column, so we will continue to use it as the aggregating column. Let's find the maximum salary by each title. The grouping column is `title`, the aggregating column is `salary`, and the aggregating function is `max`.

In [None]:
emp.groupby('title').agg(max_salary=('salary', 'max')).head().round(-3)

Let's find the sum of all salaries by sex. The grouping column is `sex`, the aggregating column is `salary`, and the aggregating function is `sum`.

In [None]:
emp.groupby('sex').agg(sum_salary=('salary', 'sum'))

## More on method chaining with `groupby`
The `groupby` syntax is a bit strange in that it requires method chaining to deliver results. Let's examine the results of making a call just to the `groupby` method.

In [None]:
emp.groupby('sex')

### What is that?
The result of any call to a method in Python always returns something even if that object is `None`. Calling the `groupby` method is no different. It has formally returned a `DataFrameGroupBy` object. Just like all pandas objects, you can see a list of all its [attributes and methods in the API][1]. This type of object is not crucial to dive into at this point. Calling `groupby` by itself does not do much. You are simply alerting pandas that you would like to create distinct groups with a particular column. 

### Assign the `groupby` object to a  variable
Let's assign the result of the call to `groupby` as a variable and verify its type.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/groupby.html

In [None]:
g = emp.groupby('sex')
type(g)

## `GroupBy` objects
The documentation refers to the object returned from a call to the `groupby` method as a **GroupBy** object. Technically there are two specific objects - `DataFrameGroupBy` (as we saw above) and `SeriesGroupBy`. It's not necessary to think much about these objects. Just be aware that a call to `groupby` returns some other object that is not a DataFrame or a Series. It is a **GroupBy** object with its own attributes and methods.

### Some exploration of the `GroupBy` object
This GroupBy object can be explored just like any other object in Python. It has attributes and methods that can be accessed through dot notation. Below, we take a look at the `groups` and `ngroups` attributes. The `groups` attribute is a dictionary that maps the group value to the index location of each row of that group. For instance, the group `2010` has rows

In [None]:
g.groups

There is also an `ngroups` attribute that returns an integer of the number of distinct groups.

In [None]:
g.ngroups

### Calling the `agg` method from the GroupBy object
We can call the `agg` method from this GroupBy object to complete an aggregation.

In [None]:
g.agg({'salary':'std'})

### Atypical usage 

Even though it is syntactically correct to assign the result of a call to the `groupby` method to a variable name and then call the `agg` method, it is almost never written like this and should just be completed in a single line. The primary message of this last section was to show you that an intermediate object was created (either a `DataFrameGroupBy` or a `SeriesGroupBy`) and that this object is what is then being used to do the aggregation.

## The index when grouping

If you were paying close attention, you noticed that the grouping column gets placed in the index after a call to the `groupby` method. In the example below, notice that `sex` is the new index and is not a column. Also note that the returned object is a **one-column DataFrame** and NOT a Series.

In [None]:
sex_avg_salary = emp.groupby('sex').agg(avg_salary=('salary', 'mean')).round(-3)
sex_avg_salary

### The index `name`

You might be confused as to why there is the word 'sex' directly above the index. It looks like it is a column name but technically it is not. It is the `name` of the index and can be accessed as an Index attribute.

In [None]:
sex_avg_salary.index.name

### The `reset_index` method
All DataFrames come equipped with a `reset_index` method which turns the index into the first column of a DataFrame. The new index will become a simple `RangeIndex` containing the integers beginning with 0.

In [None]:
emp.groupby('sex').agg(avg_salary=('salary', 'mean')).round(-3).reset_index()

## Exercises

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

### Exercise 1

<span  style="color:green; font-size:16px">Find the maximum salary for each sex.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

Execute the cell below to read in the NYC deaths dataset and use it to answer the following exercises.

In [None]:
deaths = pd.read_csv('../data/nyc_deaths.csv')
deaths.head(3)

### Exercise 3

<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

### Exercise 4

<span  style="color:green; font-size:16px">What year had the most deaths?</span>

### Exercise 5

<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>