# Tutorial 2.6: NumPy Aggregation Functions
Python for Data Analaytics | Module 2  
Professor James Ng

In this tutorial, we are going to cover *`NumPy`* functions that reduce (or aggregate) all the values in an array and spit out one result. These functions allow us to come up with summary statistics for a given data set.

Plus, in my opinion, it is where things start to get fun in our data analysis. So, I hope you enjoy!

In [None]:
# Import NumPy as always...
# Along with Pandas for loading our data
import numpy as np
import pandas as pd

In [None]:
# Create a directory to hold our data sets and download them
!mkdir -p data-sets
!wget --show-progress -O data-sets/chicago-employees.csv https://osf.io/8svw4/download

# Load the Chicago government employees data set.
chicago_employees = pd.read_csv('data-sets/chicago-employees.csv')
chicago_employees.head()

In [None]:
# I'll go ahead and create a Numpy `ndarray` for each column 
# of our data set, even though we might not use them all.
employee_names = np.array(chicago_employees['Name'])
employee_titles = np.array(chicago_employees['Job Titles'])
employee_departments = np.array(chicago_employees['Department'])
employee_full_or_part_time = np.array(chicago_employees['Full or Part-Time'])
employee_salary_or_hourly = np.array(chicago_employees['Salary or Hourly'])

# These arrays have different index values because the NaN values have
# been filtered out.  Because of that, you can't use them in coordination
# with other arrays. We will revisit this in Pandas.
employee_typical_hours = np.array(
    chicago_employees['Typical Hours'][chicago_employees['Typical Hours'].notnull()])

_ = chicago_employees['Annual Salary'][chicago_employees['Annual Salary'].notnull()]
employee_salaries = np.array(pd.to_numeric(_.str.replace('\$|\,', '')))

_ = chicago_employees['Hourly Rate'][chicago_employees['Hourly Rate'].notnull()]
employee_hourly_rate = np.array(pd.to_numeric(_.str.replace('$', '')))

## `np.sum()`
We've actually seen `np.sum()` once before. It came up during our coverage of comparison functions where we used it to count how many `True` values exist in Numpy arrays.

The reason it worked there is because `True` objects have a numeric equivalent value of 1.

Now we come back to `np.sum()` to see it's true power. Adding up all the numeric objects in an array.

In [None]:
# What is the total salary of the first 10 salary records?
np.sum(employee_salaries[:10])

The function also works with multidimensional arrays.

In [None]:
# Create a 10x8 grid of the first 80 salary records
salaries_10_by_8_grid = employee_salaries[:80].reshape(10, 8)
print(salaries_10_by_8_grid)

In [None]:
# `np.sum()` gets the total of all values in all dimensions 
# of the multi-dimensional array.
np.sum(salaries_10_by_8_grid)

### No Love for Standard Python `sum()`, `min()`, and `max()`
Python has a standard `sum()` method, but I haven't demonstrated it here. The same is true for the `min()` & `max()` functions will we look at next.

The reason I didn't mention it here is because you will almost always want to use the NumPy versions of these functions. Remember that the standard Python versions won't have the speed advantages of the NumPy ones and they do not always support multi-dimensional arrays.

So, unless you've got a good reason to do so, use the NumPy versions of aggregation functions.

### Special Note: Aggregating by Dimension in Multidimensional Arrays
*Fair warning, this is going make your head hurt.*

Another feature offered by NumPy aggregation function when dealing with multidimensional arrays is the ability to spit out calculations not just for the array as a whole, but for a given dimension.

You do this be specifying an `axis` parameter when invoking the function. Let's go through some examples to demonstrate.

In [None]:
# Take a slice out of our `salaries_10_by_8_grid` grid so that
# so that we have something more manageable to work with here.
salaries_4_by_5_grid = salaries_10_by_8_grid[0:4, 0:5]
print(salaries_4_by_5_grid)

And now let's say that we wanted to get the sum of each row or each column.

In [None]:
# Calculate the total salary for each column
np.sum(salaries_4_by_5_grid, axis=0)

Note: 
axis=0 means "perform the operation row-wise", or vertically. 
axis=1 means "perform the operation column-wise", or horizontally.

In [None]:
# Calculate the total salary for each row
np.sum(salaries_4_by_5_grid, axis=1)

## `np.min()` & `np.max()`
These functions are pretty self-explanatory: 
* `np.min()` gives you the minimum value in an array (or specified array dimension). 
* `np.max()` gives you the maximum value in an array (or specified array dimension). 

In [None]:
# Grab the min and max salaries out of `employee_hourly_rate`
print(np.min(employee_hourly_rate))  
print(np.max(employee_hourly_rate))

In [None]:
# Now let's try it with our `salaries_4_by_5_grid`
np.min(salaries_4_by_5_grid), np.max(salaries_4_by_5_grid)

In [None]:
# What is the the maximum salary in each column
# of `salaries_4_by_5_grid`?
np.max(salaries_4_by_5_grid, axis=0)

In [None]:
# What is the the minimum salary in each row
# of `salaries_4_by_5_grid`?
np.min(salaries_4_by_5_grid, axis=1)

## `np.mean()` & `np.median()`
Before we demonstrate these functions, let's briefly review the difference between **mean** and **median** values.

* The **mean** value of an array is the average value of all elements.
* The **median** value is the value at which half of the array elements are less and half of the array elements are more than itself.
    * Special Note: If there is an even number of elements in an array, the median will be the mean (average) of the middle two elements.

In [None]:
# Let's calculate the mean of our `employee_hourly_rate`
np.mean(employee_hourly_rate)

In [None]:
# Now let's calculate the median.
np.median(employee_hourly_rate)

## `np.std()`
Now let's talk about the **standard deviation** function. If you need a quick refresher on how this value is calculated, as well as the related **variance** number, <a href="http://www.mathsisfun.com/data/standard-deviation.html" target="_blank">take a look at this website.</a>

Here are the quick definitions:
* *Variance*: The average of the squared differences from the mean.
* *Standard Deviation*: The square root of the variance.

Essentially, we use these functions to determine how spread out the values in a given array are and measure the relative position of a given element.

In [None]:
# Let's calculate the standard deviation of `employee_salaries`
# As well as the standard deviation of `employee_hourly_rate`
np.std(employee_salaries), np.std(employee_hourly_rate)

In [None]:
# And now let's get the mean for both of these data sets.
np.mean(employee_salaries), np.mean(employee_hourly_rate)

So, with these 4 pieces of data we have all we need to uncover particularly high/low salaries and/or hourly pay rates of Chicago employees.

In [None]:
high_salaries_mask = employee_salaries > (np.mean(employee_salaries) + np.std(employee_salaries))
low_hourly_mask = employee_hourly_rate < (np.mean(employee_hourly_rate) - np.std(employee_hourly_rate))

In [None]:
# Which employees have annual salaries greater than one standard deviation from the mean?
employee_salaries[high_salaries_mask]

In [None]:
# Which employees have hour wages less than one standard deviation from the mean?
employee_hourly_rate[low_hourly_mask]

## `np.percentile()`
Finally, let's take a look at calculating percentages. 

Remember, that a *percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.*

Thank you Wikipedia for your consise definition.

In [None]:
# What's the 25th percentile of `employee_hourly_rate`
np.percentile(employee_hourly_rate, 25)

In [None]:
# What about the 75th percentile?
np.percentile(employee_hourly_rate, 75)

You can also use this function on multi-dimensional arrays. Let's go back to using our of our sample salary grid arrays to demonstrate.

In [None]:
# Here is what this looks like again as a reminder
salaries_4_by_5_grid

In [None]:
# You can get the percentile when considering ALL elements in ALL dimensions.
np.percentile(salaries_4_by_5_grid, 65)

In [None]:
# Or you can use the `axis` parameter like in many other functions
# to calculate the percentages for rows or columns

# What is the 50th percentile for each columns
# i.e. Across the 0th (row) dimension/axis
np.percentile(salaries_4_by_5_grid, 50, axis=0)

In [None]:
# What is the 90th percentile for each row?
# i.e. Across the 1st (column) dimension/axis
np.percentile(salaries_4_by_5_grid, 90, axis=1)

Finally, you can pass a list of percentiles to a single call of `np.percentile()` if you want to calculate many percentiles at once. A handy trick.

In [None]:
# What are the 25, 50, and 75 percentiles of `employee_salaries`?
np.percentile(employee_salaries, [25, 50, 75])