# Tutorial 3.4: Pandas Aggregation Data Methods
Python for Data Analytics | Module 3  
Professor James Ng

In [1]:
# SETUP: DO NOT CHANGE
import numpy as np
import pandas as pd

In [2]:
# Optional Adjustments to Float Display
pd.options.display.float_format = '{:,.2f}'.format

## Introduction

In this tutorial, we will explore how to perform simple aggregations on Pandas **`Series`** and **`DataFrame`** objects. Per the common theme in our *pandas* coverage thus far, you will see a lot of overlap between these operations and the aggregation functions we covered in *NumPy*.

In [3]:
# We'll be using our college scorecard dataset in this tutorial.
!curl -L https://osf.io/cz253/download --create-dirs -o data-sets/college-scorecard-data-scrubbed.csv

college_scorecard = pd.read_csv(
    'data-sets/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')
college_scorecard.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0    701      0 --:--:-- --:--:-- --:--:--   701
100 2737k  100 2737k    0     0  1570k      0  0:00:01  0:00:01 --:--:-- 9339k


Unnamed: 0,UNITID,OPEID,OPEID6,institution_name,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
0,102580,884300,8843,Alaska Bible College,Palmer,AK,www.akbible.edu/,3,Bachelors,2,...,0.36,0.33,,,,0.29,,PrivacySuppressed,,
1,103501,2541000,25410,Alaska Career College,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,...,0.71,,0.79,,,0.79,28700.0,8994,0.707589494,
2,442523,4138600,41386,Alaska Christian College,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,...,0.89,,0.47,,1.0,0.68,,PrivacySuppressed,0.0,
3,102669,106100,1061,Alaska Pacific University,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,...,0.32,0.77,,1.0,,0.53,47000.0,23250,,0.514833663
4,102711,3160300,31603,AVTEC-Alaska's Institute of Technology,Seward,AK,www.avtec.edu/,1,Certificate,1,...,0.07,,1.0,,1.0,0.07,33500.0,PrivacySuppressed,0.846055789,


### Table of Common Aggregation Methods
The following table contains a list of common aggregation methods that are available on *pandas* objects. This list is not exhaustive and, as you grow in your *pandas* knowledge, you'll discover additional methods on your own.

|Method Name      |   Description  |
|-----------------|---------------------|
| `count`         | Number of non-NaN values                       |
| `min`       | Minimum value                   |
| `max`       | Maximum value    |
| `sum`        | Sum of values                    |
| `mean`        | Mean of values                              |
| `median`        | Median value                            |
| `std`        | Standard Deviation                            |
| `quantile`     | Find index of minimum value                   |
| `cumsum`     | Find index of maximum value                   |
| `cummin`     | Compute median of elements                    |
| `cummax` | Compute rank-based statistics of elements     |

## Aggregations with `Series` Objects
For most, if not all practical purposes, performing aggregations on **`Series`** objects is no different that performing the same operation on a NumPy array.

Let us perform some simple aggregations on the SAT Average series of our `college_scorecard` *DataFrame* to demonstrate.

In [None]:
# Get the SAT Average Series
sat_averages = college_scorecard['sat_average']

In [None]:
# Get the count of non NaN values in the Series
sat_averages.count()

In [None]:
# Mean/Average of SAT Averages - Kinda Funny
sat_averages.mean()

In [None]:
# Max SAT Average in Series
sat_averages.max()

In [None]:
# Min SAT Average in Series
sat_averages.min()

In [None]:
# Median SAT Average
sat_averages.median()

In [None]:
# Stardard Deviation
sat_averages.std()

In [None]:
# Quantile - this is just like np.percentile except you
# specify the desired percentiles as a fraction of 1.
sat_averages.quantile([.25, .50, .75]) # Get the 25th, 50th, and 75th percentiles

A couple of things to note here. Back in NumPy, if you had an array with NaN values in it, those NaN values would have prevented you from getting back anything helpful from the equivalent aggregations functions:

In [None]:
# One NaN value in a NumPy array means that you'll get NaN as the 
# result of aggregation functions. NOT COOL.
example_array = np.array([1, 2, np.NaN])
np.max(example_array)

The `sat_averages` *Series* contains many `NaN` values and yet *pandas* give us a helpful result when we invoke it's aggregation methods. *pandas* performs whatever calculation we asked for with whatever data is available, and simply overlooks NaN values.

You can however, turn this ability off by specifying the `skipna` parameter with a value of `False`. This doesn't make much sense when operating on an individual *Series* object, but you might want to use it when dealing with a *DataFrame*.

In [None]:
# You can make pandas act like NumPy and barf on NaN values if you want.
sat_averages.mean(skipna=False)

## Aggregations with `DataFrame` Objects
When you invoke one of the aggregation methods on a *DataFrame* object, *pandas* will by default attempt to perform the requested aggregation on a per column basis.

Let's demonstrate with a couple of the methods.

In [None]:
# Get the count of valid (not NaN) entries in the 
# all the columns of the dataset.

# As an aside, this is a quick way of identifying
# which columns have a lot of NaN values.
college_scorecard.count()

In [None]:
# With many columns, the printout from count() is truncated. To see them all, use info()
college_scorecard.info()

In [None]:
# Get the mean of the first ten columns
college_scorecard.mean()[0:10]

If you want, you can override *pandas* default behavior and have it aggregate across the columns' axis (i.e. by rows) by specifying the `axis` parameter with a value of `1`.

In most cases you won't find this helpful unless you've transposed your data so that the columns/rows have switched places.

That said, the `count()` method can still be of some value here:

In [None]:
# For each row, how many columns have valid (non-NaN) values?
# In other words, how many valid values does EACH ROW have?

# Here we will display the results for the first
# 10 rows
college_scorecard.count(axis=1)[:10]

Now, for the sake of completeness, I'll quickly demonstrate the remaining aggregation methods. Take note that some of these methods only return data on numeric Series, while others will return data for both numeric and non-numeric Series.

In [None]:
# mean()
# Numeric Only
college_scorecard.mean().head()

In [None]:
# sum()
# Numeric and Non-Numeric 
# Has the strange effect of concatenating string Series values together into 
# really really reallllllly long strings.
college_scorecard.sum().head()

In [None]:
# You can specify to only return data on numeric fields
# on this and other methods that process non-numeric data.
college_scorecard.sum(numeric_only=True)[:10]

In [None]:
# min()
# Numeric and Non-Numeric 
college_scorecard.min()[:10]

In [None]:
# max()
# Numeric and Non-Numeric 
college_scorecard.max()[:10]

In [None]:
# std()
# Numeric Only
college_scorecard.std()[:10]

In [None]:
# quantile()
# Numeric Only

# This on is interesting in that if you pass multiple values, *pandas*
# returns a DataFrame object with the requested data.
college_scorecard.quantile([.25, .75])

In [None]:
# If you pass only one value, you just get a Series back.
college_scorecard.quantile(.50)[:10]

## The `describe()` method
When you are doing initial exploratory analysis of a data set the `describe()` method can be very handy. It is available on both **`Series`** and **`DataFrame`** objects and outputs a variety of aggregations that are very useful in getting the general "sense" of a dataset.

Take a look at the output for our **`sat_average`** series and **`college_scorecard`** dataframe.

In [None]:
sat_averages.describe()

In [None]:
college_scorecard.describe()

### Tweaking `describe()` behavior with `include` and `exclude` parameters.
When used on a **`DataFrame`** object, the default behavior of the **`describe()`** method is to provide statistics on numeric columns only.

Let's take a look at the **`dtypes`** attribute on our college_scorecard dataframe to see what columns this does/doesn't include.

In [None]:
college_scorecard.dtypes

See all the places where it lists the datatype of a column as 'object'? These columns won't be reported on with the **`describe()`** method when using the default parameters.

We can change this using either the **`include`** or the **`exclude`** parameters:

In [None]:
# Include the object datatype columns
college_scorecard.describe(include=[np.object])

In [None]:
# Exclude the numeric datatypes
college_scorecard.describe(exclude=[np.number])

There are two things here that are important to notice:
1. The type of statistics returned changed when operating on **`object`** column types.
2. I used NumPy datatypes in the specification of what to include and exclude. While you could do it other ways, this is the recommendation of *pandas* itself.

Finally, you can specify **`include='all'`** to force Pandas to evaluate all columns.  It will inject `NaN` where
a calculation cannot be done.

In [None]:
college_scorecard.describe(include='all')