# Summarising, Aggregating, and Grouping data in Python Pandas

In [1]:
import os, sys, time
from time import sleep
from pathlib import Path

In [2]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import dateutil

In [3]:
# Load data from csv file
data = pd.read_csv("C:/data/inbound/2020-03-20/phone_data.csv")
# Convert date from string to date time
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True) 

### Summarasing the DataFrame

#### Once the data has been loaded into python, Python makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:

In [4]:
# How many rows the dataset 
data['item'].count()

830

In [6]:
# What was the longest phone call / data entry?
data['duration'].max()

10528.0

In [7]:
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()

92321.0

In [8]:
# How many entries are there for each month?
data['month'].value_counts()

2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64

In [9]:
# Number of non-null unique network entries
data['network'].nunique()

9

| Function    | Description |
| ----------- |:-----------:|
| **count**   |Number of non-null observations |  
| **sum**     |Sum of values | 
| mean        |Mean of values | 
| mad         |Mean absolute deviation | 
| *count*     |Number of non-null observations | 
| *count*     |Number of non-null observations | 
| *count*     |Number of non-null observations | 

## Summarising Groups in the DataFrame

#### The `groupby()` function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. 

#### the GroupBy object `.groups` variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. 

For example:

In [10]:
data.groupby(['month']).groups.keys()

dict_keys(['2014-11', '2014-12', '2015-01', '2015-02', '2015-03'])

In [11]:
len(data.groupby(['month']).groups['2014-11'])

230

In [12]:
# Get the first entry for each month
data.groupby('month').first()

Unnamed: 0_level_0,index,date,duration,item,network,network_type
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-11,0,2014-10-15 06:58:00,34.429,data,data,data
2014-12,228,2014-11-13 06:58:00,34.429,data,data,data
2015-01,381,2014-12-13 06:58:00,34.429,data,data,data
2015-02,577,2015-01-13 06:58:00,34.429,data,data,data
2015-03,729,2015-02-12 20:15:00,69.0,call,landline,landline


In [13]:
# Get the sum of the durations per month
data.groupby('month')['duration'].sum()

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

In [14]:
# Get the number of dates / entries in each month
data.groupby('month')['date'].count()

month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

In [15]:
# What is the sum of durations, for calls only, to each network
data[data['item'] == 'call'].groupby('network')['duration'].sum()

network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

#### You can also group by more than one variable, allowing more complex queries

In [16]:
# How many calls, sms, and data entries are in each month?
data.groupby(['month', 'item'])['date'].count()

month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

In [17]:
# How many calls, texts, and data are sent per month, split by network_type?
data.groupby(['month', 'network_type'])['date'].count()

month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
2014-12  data             30
         landline          7
         mobile          108
         voicemail         8
         world             4
2015-01  data             31
         landline         11
         mobile          160
         voicemail         3
2015-02  data             31
         landline          8
         mobile           90
         special           2
         voicemail         6
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, dtype: int64

## Groupby output format - Series or DataFrame?

#### The output from a groupby and aggregation operation varies between Pandas Series and Pandas Dataframes, which can be confusing for new users. 

#### As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.

In [19]:
# produces Pandas Series
data.groupby('month')['duration'].sum()

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

In [20]:
# Produces Pandas DataFrame
data.groupby('month')[['duration']].sum()

Unnamed: 0_level_0,duration
month,Unnamed: 1_level_1
2014-11,26639.441
2014-12,14641.87
2015-01,18223.299
2015-02,15522.299
2015-03,22750.441


#### The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

In [21]:
data.groupby('month', as_index=False).agg({"duration": "sum"})

Unnamed: 0,month,duration
0,2014-11,26639.441
1,2014-12,14641.87
2,2015-01,18223.299
3,2015-02,15522.299
4,2015-03,22750.441


### Multiples Statistics per Group

#### The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.

### Applying a single function to columns in groups

#### Instructions for aggregation are provided in the form of a python dictionary or list. The dictionary keys are used to specify the columns upon which you’d like to perform operations, and the dictionary values to specify the function to run.

In [22]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']
            ).agg({'duration':sum, # Sum duration per group
                   'network_type': "count", # get the count of networks
                   'date': 'first' # get the first date per group
                  }
                 )

Unnamed: 0_level_0,Unnamed: 1_level_0,duration,network_type,date
month,item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,call,25547.0,107,2014-10-15 06:58:00
2014-11,data,998.441,29,2014-10-15 06:58:00
2014-11,sms,94.0,94,2014-10-16 22:18:00
2014-12,call,13561.0,79,2014-11-14 17:24:00
2014-12,data,1032.87,30,2014-11-13 06:58:00
2014-12,sms,48.0,48,2014-11-14 17:28:00
2015-01,call,17070.0,88,2014-12-15 20:03:00
2015-01,data,1067.299,31,2014-12-13 06:58:00
2015-01,sms,86.0,86,2014-12-15 19:56:00
2015-02,call,14416.0,67,2015-01-15 10:36:00


#### The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using “lambda” functions to extract statistics that are not provided by the built-in options.

In [26]:
# Define the aggregation procedure outside of the groupby operation
aggregations = {
    'duration':'sum',
    'date': lambda x: max(x) - 1
}

In [27]:
data.groupby('month').agg(aggregations)

ValueError: Cannot add integral value to Timestamp without freq.

### Applying Multiple FUnctions to Columns in Groups

#### To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe. 

See below:

In [28]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({
    # Find the min, max, and sum of the duration column
    'duration': [min, max, sum],
    # find the number of network type entries
    'network_type': "count",
    # minimum, first, and number of unique dates
    'date': [min, 'first', 'nunique']
}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,duration,duration,duration,network_type,date,date,date
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,sum,count,min,first,nunique
month,item,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2014-11,call,1.0,1940.0,25547.0,107,2014-10-15 06:58:00,2014-10-15 06:58:00,104
2014-11,data,34.429,34.429,998.441,29,2014-10-15 06:58:00,2014-10-15 06:58:00,29
2014-11,sms,1.0,1.0,94.0,94,2014-10-16 22:18:00,2014-10-16 22:18:00,79
2014-12,call,2.0,2120.0,13561.0,79,2014-11-14 17:24:00,2014-11-14 17:24:00,76
2014-12,data,34.429,34.429,1032.87,30,2014-11-13 06:58:00,2014-11-13 06:58:00,30
2014-12,sms,1.0,1.0,48.0,48,2014-11-14 17:28:00,2014-11-14 17:28:00,41
2015-01,call,2.0,1859.0,17070.0,88,2014-12-15 20:03:00,2014-12-15 20:03:00,84
2015-01,data,34.429,34.429,1067.299,31,2014-12-13 06:58:00,2014-12-13 06:58:00,31
2015-01,sms,1.0,1.0,86.0,86,2014-12-15 19:56:00,2014-12-15 19:56:00,58
2015-02,call,1.0,1863.0,14416.0,67,2015-01-15 10:36:00,2015-01-15 10:36:00,67


#### The agg(..) syntax is flexible and simple to use. Remember that you can pass in custom and lambda functions to your list of aggregated calculations, and each will be passed the values from the column in your grouped data.

### Renaming grouped aggregation columns

#### We’ll examine two methods to group Dataframes and rename the column results in your work.

![image.png](attachment:image.png)

### Recommended: Tuple Named Aggregations

In [29]:
data[data['item'] == 'call'].groupby('month').agg(
# Get max of the duration column for each group
max_duration=('duration', max),
# Get min of the duration column for each group
min_duration=('duration', min),
# Get sum of the duration column for each group
total_duration=('duration', sum),
# Apply a lambda to date column
num_days=("date", lambda x: (max(x) - min(x)).days)
)

Unnamed: 0_level_0,max_duration,min_duration,total_duration,num_days
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,1940.0,1.0,25547.0,28
2014-12,2120.0,2.0,13561.0,30
2015-01,1859.0,2.0,17070.0,30
2015-02,1863.0,1.0,14416.0,25
2015-03,10528.0,2.0,21727.0,19


#### Grouping with named aggregation using new Pandas 0.25 syntax. Tuples are used to specify the columns to work on and the functions to apply to each grouping.


#### For clearer naming, Pandas also provides the NamedAggregation named-tuple, which can be used to achieve the same as normal tuples:

In [30]:
data[data['item'] == 'call'].groupby('month').agg(
max_duration=pd.NamedAgg(column='duration', aggfunc=max),
min_duration=pd.NamedAgg(column='duration', aggfunc=min),
total_duration=pd.NamedAgg(column='duration', aggfunc=sum),
num_days=pd.NamedAgg(
column="date",
aggfunc=lambda x: (max(x) - min(x)).days)
)

Unnamed: 0_level_0,max_duration,min_duration,total_duration,num_days
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,1940.0,1.0,25547.0,28
2014-12,2120.0,2.0,13561.0,30
2015-01,1859.0,2.0,17070.0,30
2015-02,1863.0,1.0,14416.0,25
2015-03,10528.0,2.0,21727.0,19


Note that in versions of Pandas after release, applying lambda functions only works for these named aggregations when they are the only function applied to a single column, otherwise causing a KeyError.

### Renaming index using droplevel and ravel

#### When multiple statistics are calculated on columns, the resulting dataframe will have a multi-index set on the column axis. The multi-index can be difficult to work with, and I typically have to rename columns after a groupby operation.

#### One option is to drop the top level (using .droplevel) of the newly created multi-index on columns using:

In [33]:
grouped = data.groupby('month').agg("duration":[min, max, mean])
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"min": "min_duration", "max": "max_duration", "mean": "mean_duration"})
grouped.head()

SyntaxError: invalid syntax (<ipython-input-33-eadefda6d2a5>, line 1)