# Pandas Library Practice

- The Pandas DataFrame – loading, editing, and viewing data in Python.
- Summarising, Aggregating, and Grouping data in Python Pandas

Since May 2018

Tianyu

---

## What is a Python Pandas DataFrame?
The Pandas library documentation defines a DataFrame as a “two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)”. In plain terms, think of a DataFrame as a table of data, i.e. a single set of formatted two-dimensional data, with the following characteristics:

There can be multiple rows and columns in the data.
Each row represents a sample of data,
Each column contains a different variable that describes the samples (rows).
The data in every column is usually the same type of data – e.g. numbers, strings, dates.
Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.


---

## Creating Pandas DataFrames
We’ll examine two methods to create a DataFrame – manually, and from comma-separated value (CSV) files.

### Manually entering data
The start of every data science project will include getting useful data into an analysis environment, in this case Python. There’s multiple ways to create DataFrames of data in Python, and the simplest way is through typing the data into Python manually, which obviously only works for tiny datasets.

Skip for Now

### Loading CSV data into Pandas

Creating DataFrames from CSV (comma-separated value) files is made extremely simple with the read_csv() function in Pandas, once you know the path to your file. A CSV file is a text file containing data in table form, where columns are separated using the ‘,’ comma character, and rows are on separate lines (see here).

In [75]:
import pandas as pd
import dateutil

# Load data from csv file
data = pd.read_csv('data/FAO_database.csv', encoding='iso-8859-1')


In [76]:
data.head()

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AF,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AF,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AF,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AF,2,Afghanistan,2513,Barley and products,5142,Food,1000 tonnes,33.94,67.71,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AF,2,Afghanistan,2514,Maize and products,5521,Feed,1000 tonnes,33.94,67.71,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200


In this example, we’re going to load Global Food production data from a CSV file downloaded from the Data Science competition website, Kaggle. You can download the CSV file from Kaggle, or directly from here.

## Preview and examine data in a Pandas DataFrame
Once you have data in Python, you’ll want to see the data has loaded, and confirm that the expected columns and rows are present.

Phone numbers were removed for privacy. The date column can be parsed using the extremely handy dateutil library.

### DataFrame rows and columns with .shape
The shape command gives information on the data set size – ‘shape’ returns a tuple with the number of rows, and the number of columns for the data in the DataFrame. Another descriptive property is the ‘ndim’ which gives the number of dimensions in your data, typically 2.

In [77]:
data.shape

(21477, 63)

In [78]:
data.ndim

2

In [79]:
data.dtypes

Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
Element Code           int64
Element               object
Unit                  object
latitude             float64
longitude            float64
Y1961                float64
Y1962                float64
Y1963                float64
Y1964                float64
Y1965                float64
Y1966                float64
Y1967                float64
Y1968                float64
Y1969                float64
Y1970                float64
Y1971                float64
Y1972                float64
Y1973                float64
Y1974                float64
Y1975                float64
Y1976                float64
Y1977                float64
Y1978                float64
Y1979                float64
Y1980                float64
                      ...   
Y1984                float64
Y1985                float64
Y1986                float64
Y1987         

### Describing data with .describe()
Finally, to see some of the core statistics about a particular column, you can use the ‘describe‘ function.

- For numeric columns, describe() returns basic statistics: the value count, mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th quantiles for the data in a column.

- For string columns, describe() returns the value count, the number of unique entries, the most frequently occurring value (‘top’), and the number of times the top value occurs (‘freq’)

Select a column to describe using a string inside the [] braces, and call describe() as follows:

In [80]:
data.describe()

Unnamed: 0,Area Code,Item Code,Element Code,latitude,longitude,Y1961,Y1962,Y1963,Y1964,Y1965,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
count,21477.0,21477.0,21477.0,21477.0,21477.0,17938.0,17938.0,17938.0,17938.0,17938.0,...,21128.0,21128.0,21373.0,21373.0,21373.0,21373.0,21373.0,21373.0,21477.0,21477.0
mean,125.449411,2694.211529,5211.687154,20.450613,15.794445,195.262069,200.78225,205.4646,209.925577,217.556751,...,486.690742,493.153256,496.319328,508.482104,522.844898,524.581996,535.492069,553.399242,560.569214,575.55748
std,72.868149,148.973406,146.820079,24.628336,66.012104,1864.124336,1884.265591,1861.174739,1862.000116,2014.934333,...,5001.782008,5100.057036,5134.819373,5298.939807,5496.697513,5545.939303,5721.089425,5883.071604,6047.950804,6218.379479
min,1.0,2511.0,5142.0,-40.9,-172.1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-169.0,-246.0
25%,63.0,2561.0,5142.0,6.43,-11.78,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,120.0,2640.0,5142.0,20.59,19.15,1.0,1.0,1.0,1.0,1.0,...,6.0,6.0,7.0,7.0,7.0,7.0,7.0,8.0,8.0,8.0
75%,188.0,2782.0,5142.0,41.15,46.87,21.0,22.0,23.0,24.0,25.0,...,75.0,77.0,78.0,80.0,82.0,83.0,83.0,86.0,88.0,90.0
max,276.0,2961.0,5521.0,64.96,179.41,112227.0,109130.0,106356.0,104234.0,119378.0,...,360767.0,373694.0,388100.0,402975.0,425537.0,434724.0,451838.0,462696.0,479028.0,489299.0


In [81]:
data['Y2006'].describe()

count     21373.000000
mean        496.319328
std        5134.819373
min           0.000000
25%           0.000000
50%           7.000000
75%          78.000000
max      388100.000000
Name: Y2006, dtype: float64

## Selecting and Manipulating Data
The data selection methods for Pandas are very flexible. In another post on this site, I’ve written extensively about the core selection methods in Pandas – namely iloc and loc. For detailed information and to master selection, be sure to read that post. For this example, we will look at the basic method for column and row selection.

### Selecting columns
There are three main methods of selecting columns in pandas:

- using a dot notation, e.g. data.column_name,
- using square braces and the name of the column as a string, e.g. data['column_name']
- or using numeric indexing and the iloc selector data.iloc[:, <column_number>]

In [82]:
data.Y2010.head()

0    4605.0
1     442.0
2     315.0
3      60.0
4     178.0
Name: Y2010, dtype: float64

When a column is selected using any of these methodologies, a pandas.Series is the resulting datatype. A pandas series is a one-dimensional set of data. It’s useful to know the basic operations that can be carried out on these Series of data, including summing (.sum()), averaging (.mean()), counting (.count()), getting the median (.median()), and replacing missing values (.fillna(new_value)).

In [83]:
data.Y2011.mean()

553.3992420343424

Selecting **multiple columns** at the same time extracts a new DataFrame from your existing DataFrame. For selection of multiple columns, the syntax is:

- square-brace selection with a list of column names, e.g. data[['column_name_1', 'column_name_2']]
- using numeric indexing with the iloc selector and a list of column numbers, e.g. data.iloc[:, [0,1,20,22]]

In [84]:
data['Item Code'].astype(str).head()

0    2511
1    2805
2    2513
3    2513
4    2514
Name: Item Code, dtype: object

In [85]:
data.iloc[:,[0,1,3]].head()

Unnamed: 0,Area Abbreviation,Area Code,Item Code
0,AF,2,2511
1,AF,2,2805
2,AF,2,2513
3,AF,2,2513
4,AF,2,2514


### Selecting rows
Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors (selecting based on the value of another column or variable).

The basic methods to get your heads around are:

- numeric row selection using the iloc selector, e.g. data.iloc[0:10, :] – select the first 10 rows.
label-based row selection using the loc selector (this is only applicably if you have set an “index” on your dataframe. e.g. data.loc[44, :]
- logical-based row selection using evaluated statements, e.g. data[data["Area"] == "Ireland"] – select the rows where Area value is ‘Ireland’.

Note that you can combine the selection methods for columns and rows in many ways to achieve the selection of your dreams. For details, please refer to the post [“Using iloc, loc, and ix to select and index data“.](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

## Deleting rows and columns (drop)
To delete rows and columns from DataFrames, Pandas uses the “drop” function.

To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the ‘columns’ parameter has been added in Pandas which cuts out the need for ‘axis’. The drop function returns a new DataFrame, with the columns removed. To actually edit the original DataFrame, the “inplace” parameter can be set to True, and there is no returned value.

In [86]:
data.head()

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AF,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71,...,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AF,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71,...,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AF,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71,...,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AF,2,Afghanistan,2513,Barley and products,5142,Food,1000 tonnes,33.94,67.71,...,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AF,2,Afghanistan,2514,Maize and products,5521,Feed,1000 tonnes,33.94,67.71,...,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200


Rows can also be removed using the “drop” function, by specifying axis=0. Drop() removes rows based on “labels”, rather than numeric indexing. To delete rows based on their numeric position / index, use iloc to reassign the dataframe values, as in the examples below.

In [88]:
# Delete the rows with labels 0,1,5
data = data.drop([0,1,2], axis=0)
 
# Delete the rows with label "Ireland"
# For label-based deletion, set the index first on the dataframe:
data = data.set_index("Area")
data = data.drop("Ireland", axis=0) # Delete all rows with label "Ireland"
 
# Delete the first five rows using iloc selector
data = data.iloc[5:,]

In [90]:
# Deleting columns
# reset index as column
data = data.reset_index(drop=False)

# Delete the "Area" column from the dataframe
# data = data.drop("Area", axis=1)
 
# alternatively, delete columns using the columns parameter of drop
# data = data.drop(columns="Area")
 
# Delete the Area column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
data.drop("Area", axis=1, inplace=True)
 
# Delete multiple columns from the dataframe
data = data.drop(["Y2001", "Y2002", "Y2003"], axis=1)

In [91]:
data.head()

Unnamed: 0,Area Abbreviation,Area Code,Item Code,Item,Element Code,Element,Unit,latitude,longitude,Y1961,...,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AF,2,2531,Potatoes and products,5142,Food,1000 tonnes,33.94,67.71,111.0,...,276.0,294.0,294.0,260.0,242.0,250.0,192.0,169.0,196,230
1,AF,2,2536,Sugar cane,5521,Feed,1000 tonnes,33.94,67.71,45.0,...,50.0,29.0,61.0,65.0,54.0,114.0,83.0,83.0,69,81
2,AF,2,2537,Sugar beet,5521,Feed,1000 tonnes,33.94,67.71,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
3,AF,2,2542,Sugar (Raw Equivalent),5142,Food,1000 tonnes,33.94,67.71,45.0,...,124.0,152.0,169.0,192.0,217.0,231.0,240.0,240.0,250,255
4,AF,2,2543,"Sweeteners, Other",5142,Food,1000 tonnes,33.94,67.71,0.0,...,9.0,15.0,12.0,6.0,11.0,2.0,9.0,21.0,24,16


## Renaming columns
Column renames are achieved easily in Pandas using the DataFrame rename function. The rename function is easy to use, and quite flexible. Rename columns in these two ways:

- Rename by mapping old names to new names using a dictionary, with form {“old_column_name”: “new_column_name”, …}
- Rename by providing a function to change the column names with. Functions are applied to every column name.

In [92]:
# Rename columns using a dictionary to map values
# Rename the Area columnn to 'place_name'
data = data.rename(columns={"Area": "place_name"})

# Again, the inplace parameter will change the dataframe without assignment
data.rename(columns={"Area": "place_name"}, inplace=True)

# Rename multiple columns in one go with a larger dictionary
data.rename(
    columns={
        "Area": "place_name",
        "Y2001": "year_2001"
    },
    inplace=True
)

# Rename all columns using a function, e.g. convert all column names to lower case:
data.rename(columns=str.lower)

Unnamed: 0,area abbreviation,area code,item code,item,element code,element,unit,latitude,longitude,y1961,...,y2004,y2005,y2006,y2007,y2008,y2009,y2010,y2011,y2012,y2013
0,AF,2,2531,Potatoes and products,5142,Food,1000 tonnes,33.94,67.71,111.0,...,276.0,294.0,294.0,260.0,242.0,250.0,192.0,169.0,196,230
1,AF,2,2536,Sugar cane,5521,Feed,1000 tonnes,33.94,67.71,45.0,...,50.0,29.0,61.0,65.0,54.0,114.0,83.0,83.0,69,81
2,AF,2,2537,Sugar beet,5521,Feed,1000 tonnes,33.94,67.71,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
3,AF,2,2542,Sugar (Raw Equivalent),5142,Food,1000 tonnes,33.94,67.71,45.0,...,124.0,152.0,169.0,192.0,217.0,231.0,240.0,240.0,250,255
4,AF,2,2543,"Sweeteners, Other",5142,Food,1000 tonnes,33.94,67.71,0.0,...,9.0,15.0,12.0,6.0,11.0,2.0,9.0,21.0,24,16
5,AF,2,2745,Honey,5142,Food,1000 tonnes,33.94,67.71,2.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2,2
6,AF,2,2549,"Pulses, Other and products",5521,Feed,1000 tonnes,33.94,67.71,1.0,...,3.0,2.0,3.0,3.0,3.0,5.0,4.0,5.0,4,4
7,AF,2,2549,"Pulses, Other and products",5142,Food,1000 tonnes,33.94,67.71,15.0,...,17.0,35.0,37.0,40.0,54.0,80.0,66.0,81.0,63,74
8,AF,2,2551,Nuts and products,5142,Food,1000 tonnes,33.94,67.71,2.0,...,11.0,13.0,24.0,34.0,42.0,28.0,66.0,71.0,70,44
9,AF,2,2560,Coconuts - Incl Copra,5142,Food,1000 tonnes,33.94,67.71,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


In [93]:
# replace column name whitespace with underscore
data.rename(columns=lambda x: x.lower().replace(' ', '_'))

Unnamed: 0,area_abbreviation,area_code,item_code,item,element_code,element,unit,latitude,longitude,y1961,...,y2004,y2005,y2006,y2007,y2008,y2009,y2010,y2011,y2012,y2013
0,AF,2,2531,Potatoes and products,5142,Food,1000 tonnes,33.94,67.71,111.0,...,276.0,294.0,294.0,260.0,242.0,250.0,192.0,169.0,196,230
1,AF,2,2536,Sugar cane,5521,Feed,1000 tonnes,33.94,67.71,45.0,...,50.0,29.0,61.0,65.0,54.0,114.0,83.0,83.0,69,81
2,AF,2,2537,Sugar beet,5521,Feed,1000 tonnes,33.94,67.71,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
3,AF,2,2542,Sugar (Raw Equivalent),5142,Food,1000 tonnes,33.94,67.71,45.0,...,124.0,152.0,169.0,192.0,217.0,231.0,240.0,240.0,250,255
4,AF,2,2543,"Sweeteners, Other",5142,Food,1000 tonnes,33.94,67.71,0.0,...,9.0,15.0,12.0,6.0,11.0,2.0,9.0,21.0,24,16
5,AF,2,2745,Honey,5142,Food,1000 tonnes,33.94,67.71,2.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2,2
6,AF,2,2549,"Pulses, Other and products",5521,Feed,1000 tonnes,33.94,67.71,1.0,...,3.0,2.0,3.0,3.0,3.0,5.0,4.0,5.0,4,4
7,AF,2,2549,"Pulses, Other and products",5142,Food,1000 tonnes,33.94,67.71,15.0,...,17.0,35.0,37.0,40.0,54.0,80.0,66.0,81.0,63,74
8,AF,2,2551,Nuts and products,5142,Food,1000 tonnes,33.94,67.71,2.0,...,11.0,13.0,24.0,34.0,42.0,28.0,66.0,71.0,70,44
9,AF,2,2560,Coconuts - Incl Copra,5142,Food,1000 tonnes,33.94,67.71,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


## Exporting and Saving Pandas DataFrames
After manipulation or calculations, saving your data back to CSV is the next step. Data output in Pandas is as simple as loading data.

Two two functions you’ll need to know are to_csv to write a DataFrame to a CSV file, and to_excel to write DataFrame information to a Microsoft Excel file.

In [98]:
# Output data to a CSV file
# Typically, I don't want row numbers in my output file, hence index=False.
# To avoid character issues, I typically use utf8 encoding for input/output.

data.to_csv("output_filename.csv", index=False, encoding='utf8')

# Output data to an Excel file.
# For the excel output to work, you may need to install the "xlsxwriter" package.

# data.to_excel("output_excel_file.xlsx", index=False)

---

# Grouping and aggregation of data

As soon as you load data, you’ll want to group it by one value or another, and then run some calculations. There’s another post on this blog – Summarising, Aggregating, and Grouping Data in Python Pandas, that goes into extensive detail on this subject.



In [3]:
import pandas as pd
import dateutil
import numpy as np

# Load data from csv file
data = pd.read_csv('data/phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

In [4]:
data.head()

Unnamed: 0,index,date,duration,item,month,network,network_type
0,0,2014-10-15 06:58:00,34.429,data,2014-11,data,data
1,1,2014-10-15 06:58:00,13.0,call,2014-11,Vodafone,mobile
2,2,2014-10-15 14:46:00,23.0,call,2014-11,Meteor,mobile
3,3,2014-10-15 14:48:00,4.0,call,2014-11,Tesco,mobile
4,4,2014-10-15 17:27:00,4.0,call,2014-11,Tesco,mobile


The main columns in the file are:

- date: The date and time of the entry
- duration: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of - texts sent (usually 1) for each sms entry.
- item: A description of the event occurring – can be one of call, sms, or data.
- month: The billing month that each entry belongs to – of form ‘YYYY-MM’.
- network: The mobile network that was called/texted for each entry.
- network_type: Whether the number being called was a mobile, international (‘world’), voicemail, landline, or other (‘special’) number.

## Summarising the DataFrame
Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:

In [5]:
# How many entries are there for each month?
data['month'].value_counts()

2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64

In [6]:
# Number of non-null unique network entries
data['network'].nunique()

9

## Summarising Groups in the DataFrame
There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression  data.groupby('month')  will split our current DataFrame by month.

The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

In [7]:
data.groupby(['month']).groups.keys()

dict_keys(['2014-12', '2015-03', '2015-02', '2014-11', '2015-01'])

Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to obtain summary statistics for each group – an immensely useful function. This functionality is similar to the dplyr and plyr libraries for R. Different variables can be excluded / included from each summary requirement.

In [8]:
# Get the first entry for each month
data.groupby('month').first()

Unnamed: 0_level_0,index,date,duration,item,network,network_type
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-11,0,2014-10-15 06:58:00,34.429,data,data,data
2014-12,228,2014-11-13 06:58:00,34.429,data,data,data
2015-01,381,2014-12-13 06:58:00,34.429,data,data,data
2015-02,577,2015-01-13 06:58:00,34.429,data,data,data
2015-03,729,2015-02-12 20:15:00,69.0,call,landline,landline


In [9]:
# Get the sum of the durations per month
data.groupby('month')['duration'].sum()

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

In [10]:
# Get the number of dates / entries in each month
data.groupby('month')['date'].count()

month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

In [11]:
# What is the sum of durations, for calls only, to each network
data[data['item'] == 'call'].groupby('network')['duration'].sum()

network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

In [14]:
# How many calls, sms, and data entries are in each month?
data.groupby(['month', 'item'])['date'].count()

month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

In [15]:
# How many calls, texts, and data are sent per month, split by network_type?
data.groupby(['month', 'network_type'])['date'].count()

month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
2014-12  data             30
         landline          7
         mobile          108
         voicemail         8
         world             4
2015-01  data             31
         landline         11
         mobile          160
         voicemail         3
2015-02  data             31
         landline          8
         mobile           90
         special           2
         voicemail         6
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, dtype: int64

## Groupby output format – Series or DataFrame?

The output from a groupby and aggregation operation varies between Pandas Series and Pandas Dataframes, which can be confusing for new users. 
- As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. 
- For a single column of results, the agg function, by default, will produce a Series.

In [16]:
data.groupby('month')['duration'].sum() # produces Pandas Series

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

In [17]:
data.groupby('month')[['duration']].sum() # Produces Pandas DataFrame

Unnamed: 0_level_0,duration
month,Unnamed: 1_level_1
2014-11,26639.441
2014-12,14641.87
2015-01,18223.299
2015-02,15522.299
2015-03,22750.441


The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

In [18]:
data.groupby('month', as_index=False).agg({"duration": "sum"})

Unnamed: 0,month,duration
0,2014-11,26639.441
1,2014-12,14641.87
2,2015-01,18223.299
3,2015-02,15522.299
4,2015-03,22750.441


## Multiple Statistics per Group
The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the **agg()** function allows multiple statistics to be calculated per group in one calculation. The syntax is simple, and is similar to that of MongoDB’s aggregation framework.

### Applying a single function to columns in groups
Instructions for aggregation are provided in the form of a python dictionary or list. The dictionary keys are used to specify the columns upon which you’d like to perform operations, and the dictionary values to specify the function to run.

In [19]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration':sum,      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'})    # get the first date per group

Unnamed: 0_level_0,Unnamed: 1_level_0,date,network_type,duration
month,item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,call,2014-10-15 06:58:00,107,25547.0
2014-11,data,2014-10-15 06:58:00,29,998.441
2014-11,sms,2014-10-16 22:18:00,94,94.0
2014-12,call,2014-11-14 17:24:00,79,13561.0
2014-12,data,2014-11-13 06:58:00,30,1032.87
2014-12,sms,2014-11-14 17:28:00,48,48.0
2015-01,call,2014-12-15 20:03:00,88,17070.0
2015-01,data,2014-12-13 06:58:00,31,1067.299
2015-01,sms,2014-12-15 19:56:00,86,86.0
2015-02,call,2015-01-15 10:36:00,67,14416.0


The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using “lambda” functions to extract statistics that are not provided by the built-in options.
Skip for now

### Applying multiple functions to columns in groups
To apply multiple functions to a single column in your grouped data, expand the syntax above to pass in a list of functions as the value in your aggregation dataframe. See below:

In [21]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration': [min, max, sum],      # find the min, max, and sum of the duration column
                                     'network_type': "count", # find the number of network type entries
                                     'date': [min, 'first', 'nunique']})    # get the min, first, and number of unique dates per group

Unnamed: 0_level_0,Unnamed: 1_level_0,date,date,date,network_type,duration,duration,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,min,first,nunique,count,min,max,sum
month,item,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2014-11,call,2014-10-15 06:58:00,2014-10-15 06:58:00,104,107,1.0,1940.0,25547.0
2014-11,data,2014-10-15 06:58:00,2014-10-15 06:58:00,29,29,34.429,34.429,998.441
2014-11,sms,2014-10-16 22:18:00,2014-10-16 22:18:00,79,94,1.0,1.0,94.0
2014-12,call,2014-11-14 17:24:00,2014-11-14 17:24:00,76,79,2.0,2120.0,13561.0
2014-12,data,2014-11-13 06:58:00,2014-11-13 06:58:00,30,30,34.429,34.429,1032.87
2014-12,sms,2014-11-14 17:28:00,2014-11-14 17:28:00,41,48,1.0,1.0,48.0
2015-01,call,2014-12-15 20:03:00,2014-12-15 20:03:00,84,88,2.0,1859.0,17070.0
2015-01,data,2014-12-13 06:58:00,2014-12-13 06:58:00,31,31,34.429,34.429,1067.299
2015-01,sms,2014-12-15 19:56:00,2014-12-15 19:56:00,58,86,1.0,1.0,86.0
2015-02,call,2015-01-15 10:36:00,2015-01-15 10:36:00,67,67,1.0,1863.0,14416.0


### Renaming grouped statistics from groupby operations
When multiple statistics are calculated on columns, the resulting dataframe will have a multi-index set on the column axis. This can be difficult to work with, and I typically have to rename columns after a groupby operation.

One option is to drop the top level (using .droplevel) of the newly created multi-index on columns using:

In [39]:
grouped = data.groupby('month').agg({'duration': [min, max, np.mean]})
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"min": "min_duration", "max": "max_duration", "mean": "mean_duration"})
grouped.head()

Unnamed: 0_level_0,min,max,mean
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-11,1.0,1940.0,115.823657
2014-12,1.0,2120.0,93.260318
2015-01,1.0,1859.0,88.894141
2015-02,1.0,1863.0,113.301453
2015-03,1.0,10528.0,225.251891


However, this approach loses the original column names, leaving only the function names as column headers. A neater approach, as suggested to me by a reader, is using the ravel() method on the grouped columns. Ravel() turns a Pandas multi-index into a simpler array, which we can combine into sensible column names:

In [37]:
grouped = data.groupby('month').agg({'duration': [min, max, np.mean]}) 
# Using ravel, and a string join, we can create better names for the columns:
grouped.columns = ["_".join(x) for x in grouped.columns.ravel()]

In [38]:
grouped.head()

Unnamed: 0_level_0,duration_min,duration_max,duration_mean
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-11,1.0,1940.0,115.823657
2014-12,1.0,2120.0,93.260318
2015-01,1.0,1859.0,88.894141
2015-02,1.0,1863.0,113.301453
2015-03,1.0,10528.0,225.251891
