In [1]:
import pandas as pd
import openpyxl
from pathlib import Path
input_file = Path.cwd()/'data'/'raw'/'sample_sales.xlsx'
df = pd.read_excel(input_file,engine='openpyxl')
df.head()

Unnamed: 0,invoice,company,purchase_date,product,quantity,price,extended amount
0,ZN-870-29,Realcube,2019-03-05,shirt,19,17,323
1,JQ-501-63,Zooxo,2019-07-09,book,30,14,420
2,FI-165-58,Dabtype,2019-08-12,poster,7,23,161
3,XP-005-55,Skipfire,2019-11-18,pen,7,29,203
4,NB-917-18,Bluezoom,2019-04-18,poster,36,19,684


## Aggregation

Simple aggregation

In [2]:
df['price'].agg(['mean'])

mean    22.816
Name: price, dtype: float64

Adding aggregations

In [3]:
df['price'].agg(['mean','std'])

mean    22.816000
std      7.537039
Name: price, dtype: float64

Defining the columns and aggregations in a dictionary

In [4]:
agg_cols = {'quantity':'sum',
            'price':['mean','std'],
           'invoice':'count',
           'extended amount':'sum'}

In [5]:
df.agg(agg_cols)

Unnamed: 0,quantity,price,invoice,extended amount
sum,22421.0,,,510270.0
mean,,22.816,,
std,,7.537039,,
count,,,1000.0,


Aggregation across all columns (which sometimes doesn't make sense, e.g. mean of product names)

In [6]:
df.agg(['mean','max'])

Unnamed: 0,invoice,company,purchase_date,product,quantity,price,extended amount
max,ZY-479-41,Zoozzy,2019-12-30 00:00:00.000,shirt,50.0,35.0,1715.0
mean,,,2019-07-04 17:41:16.800,,22.421,22.816,510.27


Zap those NaN's

In [7]:
df.agg(['mean','max']).fillna("")

Unnamed: 0,invoice,company,purchase_date,product,quantity,price,extended amount
max,ZY-479-41,Zoozzy,2019-12-30 00:00:00.000,shirt,50.0,35.0,1715.0
mean,,,2019-07-04 17:41:16.800,,22.421,22.816,510.27


## Grouping

Group by product, and sum all values (note that it intelligently ignores the sum of entities that don't make sense to sum, e.g. company name)

In [8]:
df.groupby(['product']).sum()

Unnamed: 0_level_0,quantity,price,extended amount
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
book,5340,5115,118356
pen,5005,5271,115017
poster,5827,6258,139008
shirt,6249,6172,137889


Group by product, but only sum over quantity

In [9]:
df.groupby(['product'])['quantity'].sum()

product
book      5340
pen       5005
poster    5827
shirt     6249
Name: quantity, dtype: int64

Another way to achieve the same end, this time by using a dictionary like last time, a dictionary that contains the columns that we would like to aggregate. This below is saying we want to aggregate the column 'quantity', and we want to get the 'sum' of this aggregation.

In [10]:
agg_cols = {'quantity' : 'sum'}

In [11]:
df.groupby(['product']).agg(agg_cols)

Unnamed: 0_level_0,quantity
product,Unnamed: 1_level_1
book,5340
pen,5005
poster,5827
shirt,6249


Having obtained the universal sum of products sold, this is how you do the sum but by company.

In [12]:
df.groupby(['company','product']).agg(agg_cols)

Unnamed: 0_level_0,Unnamed: 1_level_0,quantity
company,product,Unnamed: 2_level_1
Abatz,book,64
Abatz,pen,7
Abatz,poster,39
Agivu,book,11
Agivu,shirt,20
...,...,...
Zooxo,book,30
Zooxo,shirt,85
Zoozzy,pen,31
Zoozzy,poster,31


Reset the index (not sure why you'd want to do this)

In [13]:
df.groupby(['company','product']).agg(agg_cols).reset_index()

Unnamed: 0,company,product,quantity
0,Abatz,book,64
1,Abatz,pen,7
2,Abatz,poster,39
3,Agivu,book,11
4,Agivu,shirt,20
...,...,...,...
726,Zooxo,book,30
727,Zooxo,shirt,85
728,Zoozzy,pen,31
729,Zoozzy,poster,31


And finally, here's how to set up 'named columns'. In this case, we want to know for each company, *how many* ('count') invoices there are, and out of those invoices, what was the *maximum purchase* ('max').

In [16]:
df.groupby(['company']).agg(invoice_total=('invoice','count'),max_purchase=('extended amount','max'))

Unnamed: 0_level_0,invoice_total,max_purchase
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Abatz,5,1410
Agivu,2,700
Aibox,2,828
Ailane,3,400
Aimbo,3,570
...,...,...
Zoonoodle,3,644
Zooveo,4,609
Zoovu,2,165
Zooxo,3,968
