# Pandas aggregation

(when `value_counts` and `describe` won't cut it)


We can `groupby()` columns/index levels in our inputs:

In [1]:
import numpy as np
import pandas as pd

sales = pd.read_csv('./data/kaggle-sales/sales_train.csv.gz', parse_dates=['date'])
items = pd.read_csv('./data/kaggle-sales/items.csv.gz')
categories = pd.read_csv('./data/kaggle-sales/item_categories.csv.gz')

In [2]:
data = pd.merge(sales, items)
data = pd.merge(data, categories)
data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name
0,2013-02-01,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
1,2013-01-23,0,24,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
2,2013-01-20,0,27,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
3,2013-02-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
4,2013-03-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray


Let's add a column for revenue...

In [3]:
data['revenue'] = data.item_price * data['item_cnt_day']
data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,revenue
0,2013-02-01,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,999.0
1,2013-01-23,0,24,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,999.0
2,2013-01-20,0,27,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,999.0
3,2013-02-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,999.0
4,2013-03-01,0,25,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,999.0


## Question: What was our total revenue per item

value_counts won't get us what we want, so we use groupby

In [4]:
g = data.groupby('item_id')
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe373a79700>

### Iterating through groups

We *can* solve this in Python, but it's ugly and slow:

In [5]:
%%time
item_revenue = pd.Series(0.0, index=np.unique(data.item_id))
for item_id, item_data in g:
    item_revenue[item_id] = item_data.revenue.sum()

CPU times: user 6.24 s, sys: 131 ms, total: 6.38 s
Wall time: 6.33 s


In [6]:
item_revenue.sort_values(ascending=False).head()

6675     2.193915e+08
3732     4.361798e+07
13443    3.433125e+07
3734     3.106516e+07
3733     2.229886e+07
dtype: float64

### Better: use `.apply`

In [7]:
%%time
def total_revenue(data):
    return data.revenue.sum()

item_revenue = g.apply(total_revenue)

CPU times: user 3.05 s, sys: 133 ms, total: 3.18 s
Wall time: 3.16 s


In [8]:
item_revenue.sort_values(ascending=False).head()

item_id
6675     2.193915e+08
3732     4.361798e+07
13443    3.433125e+07
3734     3.106516e+07
3733     2.229886e+07
dtype: float64

### Best: use group aggregation

In [9]:
%%time
item_revenue = g.revenue.sum()

CPU times: user 13.3 ms, sys: 493 µs, total: 13.8 ms
Wall time: 11.9 ms


In [10]:
item_revenue.sort_values(ascending=False).head()

item_id
6675     2.193915e+08
3732     4.361798e+07
13443    3.433125e+07
3734     3.106516e+07
3733     2.229886e+07
Name: revenue, dtype: float64

### Computing multiple aggregates

We can also compute multiple aggregates for a single series:

In [13]:
%%time
g.revenue.agg(['min', 'mean', np.median, np.std, np.max, 'max']).head()

CPU times: user 149 ms, sys: 0 ns, total: 149 ms
Wall time: 146 ms


Unnamed: 0_level_0,min,mean,median,std,amax,max
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,58.0,58.0,58.0,,58.0,58.0
1,4490.0,4490.0,4490.0,0.0,4490.0,4490.0
2,58.0,58.0,58.0,0.0,58.0,58.0
3,58.0,79.0,79.0,29.698485,100.0,100.0
4,58.0,58.0,58.0,,58.0,58.0


You can even get multiple aggregates over multiple columns:

In [14]:
%%time
g.agg({
    'revenue': ['min', 'mean', np.std], 
    'item_price': ['min', 'mean', 'max']
}).head()

CPU times: user 107 ms, sys: 1.02 ms, total: 108 ms
Wall time: 106 ms


Unnamed: 0_level_0,revenue,revenue,revenue,item_price,item_price,item_price
Unnamed: 0_level_1,min,mean,std,min,mean,max
item_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,58.0,58.0,,58.0,58.0,58.0
1,4490.0,4490.0,0.0,4490.0,4490.0,4490.0
2,58.0,58.0,0.0,58.0,58.0,58.0
3,58.0,79.0,29.698485,58.0,79.0,100.0
4,58.0,58.0,,58.0,58.0,58.0


## Question: What percentage of each shop's revenue does each item category account for?

In [15]:
g = data.groupby('shop_id item_category_id'.split())
shop_category_sales = g.revenue.sum().rename('category_sales')
shop_category_sales.head()

shop_id  item_category_id
0        0                       93.0
         1                      283.0
         2                   186567.0
         3                    12584.0
         4                    25606.0
Name: category_sales, dtype: float64

In [16]:
shop_total_sales = data.groupby('shop_id').revenue.sum().rename('total_sales')
shop_total_sales.head()

shop_id
0    6.637370e+06
1    3.238207e+06
2    4.404964e+07
3    3.014085e+07
4    4.053965e+07
Name: total_sales, dtype: float64

In [17]:
shop = pd.merge(shop_category_sales, shop_total_sales, left_index=True, right_index=True)
shop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,category_sales,total_sales
shop_id,item_category_id,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,93.0,6637370.0
0,1,283.0,6637370.0
0,2,186567.0,6637370.0
0,3,12584.0,6637370.0
0,4,25606.0,6637370.0


In [18]:
shop['share'] = shop.category_sales / shop.total_sales
shop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,category_sales,total_sales,share
shop_id,item_category_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,93.0,6637370.0,1.4e-05
0,1,283.0,6637370.0,4.3e-05
0,2,186567.0,6637370.0,0.028109
0,3,12584.0,6637370.0,0.001896
0,4,25606.0,6637370.0,0.003858


In [19]:
shop[shop.share > 0.2]

Unnamed: 0_level_0,Unnamed: 1_level_0,category_sales,total_sales,share
shop_id,item_category_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11,12,116996.67,521655.1,0.22428
13,40,1559649.06,6006173.0,0.259674
20,61,2383770.76,6599924.0,0.361182
20,72,1482895.75,6599924.0,0.224684
34,20,1811011.2,8582822.0,0.211004
36,20,88369.0,377714.0,0.233957
39,20,3979712.97,18092890.0,0.21996
40,72,883075.0,4293587.0,0.205673
55,31,23823432.01,49792060.0,0.478458


# Transform for even more succinct code

Sometimes we might want to compute an aggregate, and then 'broadcast' it to all members of a group.

`transform` does just that: it applies an 'aggregation' function across a group and returns it in the same shape as the group itself:

In [20]:
shop_category_sales.head()

shop_id  item_category_id
0        0                       93.0
         1                      283.0
         2                   186567.0
         3                    12584.0
         4                    25606.0
Name: category_sales, dtype: float64

In [21]:
g = shop_category_sales.groupby(level='shop_id')

In [22]:
def group_share(data):
    return data / data.sum()

In [23]:
shop_share_series = g.transform(group_share).rename('shop_share')
shop_share_series

shop_id  item_category_id
0        0                   0.000014
         1                   0.000043
         2                   0.028109
         3                   0.001896
         4                   0.003858
                               ...   
59       75                  0.027380
         77                  0.000015
         79                  0.006754
         80                  0.000929
         83                  0.001116
Name: shop_share, Length: 3271, dtype: float64

In [24]:
shop = pd.concat([shop_category_sales, shop_share_series], axis=1)
shop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,category_sales,shop_share
shop_id,item_category_id,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,93.0,1.4e-05
0,1,283.0,4.3e-05
0,2,186567.0,0.028109
0,3,12584.0,0.001896
0,4,25606.0,0.003858


In [25]:
shop = pd.concat([
    shop_category_sales, 
    (
        shop_category_sales
        .groupby(level='shop_id')
        .transform(group_share)
        .rename('shop_share')
    )
], axis=1)
shop.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,category_sales,shop_share
shop_id,item_category_id,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,93.0,1.4e-05
0,1,283.0,4.3e-05
0,2,186567.0,0.028109
0,3,12584.0,0.001896
0,4,25606.0,0.003858


Open the [Pandas aggregation lab][pandas-aggregation-lab]

[pandas-aggregation-lab]: ./pandas-aggregation-lab.ipynb