# Grouping and Aggregating with Multiple Columns

In this chapter, we'll form groups using more than one column, aggregate more than one column, and learn how to apply more than one aggregation function to each group. Let's begin by reading in the San Francisco employee compensation dataset.

In [None]:
import pandas as pd
sf_emp = pd.read_csv('../data/sf_employee_compensation.csv')
sf_emp.head(3)

## Review grouping and aggregating with a single column

In the previous chapter, we had a single grouping column, aggregating column, and aggregating function. The following syntax was used as a guide:

```python
df.groupby('grouping column').agg(new_column=('aggregating column', 'aggregating function'))
```

Let's see this again by calculating the average salary for each organization group.

In [None]:
(sf_emp.groupby('organization group')
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3))

## Grouping with multiple columns

To create groups based on distinct values from multiple columns, we need to pass a list of these columns to the `groupby` method. Let's find the average salary for every unique combination of year and organization group.

In [None]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3)
       .head(10))

### What happened to our index?

Both year and organization group are no longer columns and have been pushed into the index. This is called a **multi-level index**. The year and organization group are considered **levels** of the index and are NOT columns. You'll notice that duplicated values in the outer level are not visible in an index when they immediately follow one another such as with the year level above.

### The MultiIndex is confusing and not necessary for beginners

In my opinion, the multi-level index does not add much value to pandas and can interfere with learning. I advise those new to pandas to avoid using it until they have mastered the basics. Personally, I rarely use it myself and prefer the levels of the index to be DataFrame columns.

By default, all grouping columns will be added to the index. From this point on, we will chain the `reset_index` method to return these levels to columns. Equivalently, you can achieve the same result by setting the `as_index` parameter to `False` in the `groupby` method.

In [None]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'))
       .round(-3)
       .reset_index()
       .head(10))

### Isn't the result easier to read with a MultiIndex?

The MultiIndex can make the results easier to read, but it makes further data analysis more difficult as you need to become familiar with special syntax just for the MultiIndex. In my opinion, this added complexity for beginners is not worth the benefit.

## Aggregating multiple columns

To aggregate multiple columns, set a new parameter in the `agg` method equal to another two-item tuple containing the aggregating column and aggregating function. Here, we find the average salary and overtime for each organization group.

In [None]:
(sf_emp.groupby('organization group')
       .agg(avg_salary=('salaries', 'mean'), avg_overtime=('overtime', 'mean'))
       .round(-3)
       .reset_index())

## Multiple grouping columns, aggregating columns, and aggregating functions

We can combine the last two approaches to simultaneously have multiple grouping and aggregating columns along with multiple aggregating functions. The following finds the mean, min, and max salaries along with the average overtime for every unique combination of year and organization group. 

In [None]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(avg_salary=('salaries', 'mean'),
            min_salary=('salaries', 'min'),
            max_salary=('salaries', 'max'),
            avg_overtime=('overtime', 'mean'))
       .round(-3)
       .reset_index()
       .head(10))

## Getting the size of each group

Let's say we are interested in the number of rows in each group. We can use the `size` aggregating function like this.

In [None]:
sf_emp.groupby('organization group').agg(size_salaries=('salaries', 'size'))

The `size` aggregating function is independent of the aggregating column, so regardless of which one you use, the same value is returned. Here we use three different aggregating columns to prove that the size of the group is the same.

In [None]:
(sf_emp.groupby('organization group')
       .agg(size_salary=('salaries', 'size'),
            size_overtime=('overtime', 'size'),
            size_retirement=('retirement', 'size'))
       .reset_index()
       .head(10))

### Just use `value_counts`

There isn't a need to call the `groupby` method with the `size` aggregating function when grouping by a single column. This is exactly what the Series method `value_counts` was designed for. It has the added benefit of sorting the values as well.

In [None]:
sf_emp['organization group'].value_counts()

### Multiple group size

It's possible to find the size of groups consisting of more than one column with the `groupby` method by passing it a list. The choice for aggregating columns again does not matter as the size is the same regardless.

In [None]:
(sf_emp.groupby(['year', 'organization group'])
       .agg(size_salary=('salaries', 'size'))
       .head(10))

### DataFrame `value_counts` method

Again, the `value_counts` method produces the same result, but as a Series.

In [None]:
sf_emp.value_counts(['year', 'organization group']).head()

### Rename the column when using `reset_index`

When calling `reset_index` on a Series, pandas will use the `name` attribute of the Series as the new column name. If it doesn't exist (like in the example above), it will use the integer 0 as the new column name.

In [None]:
(sf_emp.value_counts(['year', 'organization group'])
       .reset_index()
       .head(10))

Set the `name` parameter within `reset_index` to set the new column name in the resulting DataFrame.

In [None]:
(sf_emp.value_counts(['year', 'organization group'])
       .reset_index(name='size')
       .head(10))

## Exercises

Execute the following cell to read in the City of Houston employee data and use it for the first few exercises.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

### Exercise 1

<span  style="color:green; font-size:16px">For each department and sex, find the number of unique position titles, the total number of employees, and the average salary. Make sure there is no multi-level index.</span>

### Exercise 2

<span  style="color:green; font-size:16px">For each department, race, and sex find the min and max and salaries.</span>

Execute the following cell to read in the college dataset and use it for the remaining exercises.

In [None]:
pd.set_option('display.max_columns', 100)
college = pd.read_csv('../data/college.csv')
college.head(3)

### Exercise 3

<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

### Exercise 4

<span style="color:green; font-size:16px">Does the city 'Houston' only appear in the state of Texas (abbreviated 'TX')?</span>

### Exercise 5

<span style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

### Exercise 6

<span style="color:green; font-size:16px">Find the largest college from each state. From those colleges, find the difference between the largest and smallest.</span>

### Exercise 7

<span style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

### Exercise 8

<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

### Exercise 9

<span style="color:green; font-size:16px">Do distance only schools tend to be more or less religiously affiliated than non-distance-only schools?</span>

### Exercise 10

<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

### Exercise 11

<span  style="color:green; font-size:16px">Find the top 5 historically black colleges that have the highest undergraduate white percentage (ugds_white)?</span>