<a href="https://colab.research.google.com/github/wcj365/pandas-grouping/blob/master/pandas_grouping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Grouping
References:
- Blog: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

- Data: https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv

- Data Elements
    - date: The date and time of the entry
    - duration: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry.
    - item: A description of the event occurring – can be one of call, sms, or data.
    - month: The billing month that each entry belongs to – of form ‘YYYY-MM’.
    - network: The mobile network that was called/texted for each entry.
    - network_type: Whether the number being called was a mobile,     international (‘world’), voicemail, landline, or other (‘special’) number.

In [0]:
import pandas as pd

In [2]:
DATA_URL = "https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv"

df = pd.read_csv(DATA_URL)
df.head()

Unnamed: 0,index,date,duration,item,month,network,network_type
0,0,15/10/14 06:58,34.429,data,2014-11,data,data
1,1,15/10/14 06:58,13.0,call,2014-11,Vodafone,mobile
2,2,15/10/14 14:46,23.0,call,2014-11,Meteor,mobile
3,3,15/10/14 14:48,4.0,call,2014-11,Tesco,mobile
4,4,15/10/14 17:27,4.0,call,2014-11,Tesco,mobile


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index           830 non-null int64
date            830 non-null object
duration        830 non-null float64
item            830 non-null object
month           830 non-null object
network         830 non-null object
network_type    830 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB


In [3]:
# The column named "index" is not useful, drop it

df.drop(columns=["index"], inplace=True)
df.sample()

Unnamed: 0,date,duration,item,month,network,network_type
195,10/11/14 14:59,13.0,call,2014-11,Three,mobile


In [6]:
# Find out the frequency of the categorical variable "item".
# The result indicates the person made 388 phone calls, 
# sent/received 292 text messages, and used mobile data 150 times.

df["item"].value_counts()

call    388
sms     292
data    150
Name: item, dtype: int64

In [0]:
# Find out the average usage of different mode ("item") 
# This syntax returns a Pandas series since "item" becomes the row index
# and there is only one column "duration"

df.groupby("item").mean()

Unnamed: 0_level_0,duration
item,Unnamed: 1_level_1
call,237.940722
data,34.429
sms,1.0


In [0]:
# This return a data frame due to the use of option as_index = False
# This makes "item" a column so that we haave two columns "item" and "duratrion"

df.groupby("item",as_index=False).mean()

Unnamed: 0,item,duration
0,call,237.940722
1,data,34.429
2,sms,1.0


In [8]:
df.groupby("item")["duration"].mean()

item
call    237.940722
data     34.429000
sms       1.000000
Name: duration, dtype: float64

In [14]:
# Find out the maximum, minmum, and total time spent on phone calls each month.

df[df['item'] == 'call'].groupby('month').agg(
    max_duration=pd.NamedAgg(column='duration', aggfunc=max),
    min_duration=pd.NamedAgg(column='duration', aggfunc=min),
    total_duration=pd.NamedAgg(column='duration', aggfunc=sum)
)

Unnamed: 0_level_0,max_duration,min_duration,total_duration
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-11,1940.0,1.0,25547.0
2014-12,2120.0,2.0,13561.0
2015-01,1859.0,2.0,17070.0
2015-02,1863.0,1.0,14416.0
2015-03,10528.0,2.0,21727.0


In [0]:
df.pivot_table(index="item",columns="month", values="duration", aggfunc="mean")

month,2014-11,2014-12,2015-01,2015-02,2015-03
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
call,238.757009,171.658228,193.977273,215.164179,462.276596
data,34.429,34.429,34.429,34.429,34.429
sms,1.0,1.0,1.0,1.0,1.0
