In [21]:
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML('<style>{}</style>'.format(table_css))

In [None]:
import pandas as pd

# 7 Aggregation

Aggregation is an imporant topic of data analysis. Pandas offers flexible aggregation functionalities.

## 7.1 Simple aggregation using groupby

Let's start by create a simple dataset.

In [None]:
food_data = {
"Item": ["Banana", "Cucumber", "Orange", "Tomato", "Watermelon"],
"Type": ["Fruit", "Vegetable", "Fruit", "Vegetable", "Fruit"],
"Price": [0.99, 1.25, 0.25, 0.33, 3.00]
}
supermarket = pd.DataFrame(data = food_data)
supermarket

In Pandas, before we can aggregate data, we need create a `DataFrameGroupBy` object using the `groupby` method of `DataFrame`.

In [None]:
by_type = supermarket.groupby("Type")
by_type

You can inspect total groups as follows:

In [None]:
by_type.groups()

You can get a specific group as follows

In [None]:
by_type.get_group("Fruit")

You can calculate average price of each group as follows

In [None]:
by_type.mean()

## 7.2 common attributes/metheds of DataFrameGroupBy

| seq |     name     |                description                                        |
| --- | ------------ | ----------------------------------------------------------------- |
| 01  |  groups      | return  a dictionary with these group-to-row associations         |
| 02  |  first()     | return the first row in each group                                |
| 03  |  last()      | return the last row in each group                                 |
| 04  |  nth()       | return the nth row in each group, first row starts from 0         |
| 05  |  head()      | return first few rows in each group                               |
| 06  |  tail()      | return last few rows in each group                                |
| 07  |  get_group() | return a all row in the specified group                           |
| 08  |  agg()       | apply aggregation functions to different coloumns using dict      |



## 7.2 Analyze fortune 1000 companies financial data

### 7.2.1 Load fortune 1000 company data

In [25]:
fortune = pd.read_csv("fortune1000.csv")
by_sector = fortune.groupby("Sector")
by_sector.head(3).sort_values("Sector")

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry
26,Boeing,93392.0,8197.0,140800,Aerospace & Defense,Aerospace and Defense
50,United Technologies,59837.0,4552.0,204700,Aerospace & Defense,Aerospace and Defense
58,Lockheed Martin,51048.0,2002.0,100000,Aerospace & Defense,Aerospace and Defense
331,PVH,8915.0,537.8,28050,Apparel,Apparel
241,VF,12400.0,614.9,69000,Apparel,Apparel
...,...,...,...,...,...,...
49,FedEx,60319.0,2997.0,357000,Transportation,"Mail, Package, and Freight Delivery"
43,UPS,65872.0,4910.0,346415,Transportation,"Mail, Package, and Freight Delivery"
13,Cardinal Health,129976.0,1288.0,40400,Wholesalers,Wholesalers: Health Care
11,AmerisourceBergen,153144.0,364.5,19500,Wholesalers,Wholesalers: Health Care


### 7.2.2 Check number of groups

In [None]:
len(by_sector)

### 7.2.3 Check number of companies in each group

In [None]:
by_sector.size().sort_values(ascending=False)

### 7.2.4 Apply different functions to columns

You can pass a dictionary to specify different function to individual column.
For instance, you may calculate min revenue, max profit and average employees for each sector by:

In [29]:
by_sector.agg({"Revenues": "min", "Profits": "max", "Employees": "mean"})

Unnamed: 0_level_0,Revenues,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,1877.0,8197.0,40404.96
Apparel,2350.0,4240.0,25407.071429
Business Services,1851.0,6699.0,30075.45283
Chemicals,1925.0,3000.4,14364.242424
Energy,1874.0,19710.0,9170.158879
Engineering & Construction,1906.0,1038.4,15583.148148
Financials,1848.0,44940.0,22581.412903
Food & Drug Stores,2064.0,4078.0,116506.166667
"Food, Beverages & Tobacco",2071.0,10999.0,29170.702703
Health Care,1849.0,21308.0,41847.732394
