# Grouping and Aggregating Data

## Outine
* Splitting data into groups
* Aggregation


In [1]:
import pandas as pd
df = pd.read_csv('./data/nba.csv')

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0



## Splitting data into groups

Pandas `dataframe.groupby()` function is used to split the data into groups based on some criteria.

In [2]:
#Use groupby() function to group the data based on the “Team".

team = df.groupby("Team")
#print first value in each group
team.first()

Unnamed: 0_level_0,Name,Number,Position,Age,Height,Weight,College,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,Kent Bazemore,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Brooklyn Nets,Bojan Bogdanovic,44.0,SG,27.0,6-8,216.0,Oklahoma State,3425510.0
Charlotte Hornets,Nicolas Batum,5.0,SG,27.0,6-8,200.0,Virginia Commonwealth,13125306.0
Chicago Bulls,Cameron Bairstow,41.0,PF,25.0,6-9,250.0,New Mexico,845059.0
Cleveland Cavaliers,Matthew Dellavedova,8.0,PG,25.0,6-4,198.0,Saint Mary's,1147276.0
Dallas Mavericks,Justin Anderson,1.0,SG,22.0,6-6,228.0,Virginia,1449000.0
Denver Nuggets,Darrell Arthur,0.0,PF,28.0,6-9,235.0,Kansas,2814000.0
Detroit Pistons,Joel Anthony,50.0,C,33.0,6-9,245.0,UNLV,2500000.0
Golden State Warriors,Leandro Barbosa,19.0,SG,33.0,6-3,194.0,North Carolina,2500000.0


Let’s print the value contained in a group. To do that, we will use the name of the team. We use the function get_group() to find the entries contained in any one of the groups.


In [3]:
# Finding the values contained in the "Boston Celtics" group 
team.get_group("Boston Celtics")

Unnamed: 0,Name,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


We can also use the `groupby()` function to form groups based on more than one category (i.e. Use more than one column to perform the splitting).



In [4]:
# First grouping based on "Team" 
# Within each team we are grouping based on "Position" 

team_position = df.groupby(["Team", "Position"])

In [5]:
# Print the first value in each group 

team_position.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number,Age,Height,Weight,College,Salary
Team,Position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,C,Al Horford,15.0,30.0,6-10,245.0,Florida,12000000.0
Atlanta Hawks,PF,Kris Humphries,43.0,31.0,6-9,235.0,Minnesota,1000000.0
Atlanta Hawks,PG,Dennis Schroder,17.0,22.0,6-1,172.0,Wake Forest,1763400.0
Atlanta Hawks,SF,Kent Bazemore,24.0,26.0,6-5,201.0,Old Dominion,2000000.0
Atlanta Hawks,SG,Tim Hardaway Jr.,10.0,24.0,6-6,205.0,Michigan,1304520.0
...,...,...,...,...,...,...,...,...
Washington Wizards,C,Marcin Gortat,13.0,32.0,6-11,240.0,North Carolina State,11217391.0
Washington Wizards,PF,Drew Gooden,90.0,34.0,6-10,250.0,Kansas,3300000.0
Washington Wizards,PG,Ramon Sessions,7.0,30.0,6-3,190.0,Nevada,2170465.0
Washington Wizards,SF,Jared Dudley,1.0,30.0,6-7,225.0,Boston College,4375000.0


## Aggregation

Aggregation can be performed by either using the available aggregating functions on the `groupby` object, or using the aggregate function to apply arbtrary function logic.

`dataframe.aggregate()` function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or a list of string/callables. Most frequently used aggregations are:

- sum: Return the sum of the values for the requested axis
- min: Return the minimum of the values for the requested axis
- max: Return the maximum of the values for the requested axis

In [6]:
# Applying aggregation across all the columns  sum and min will be found for each  
# numeric type column in df dataframe 
  
df.aggregate(["sum", "max"])

Unnamed: 0,Number,Age,Weight,Salary
sum,8079.0,12311.0,101236.0,2159837000.0
max,99.0,40.0,307.0,25000000.0


In Pandas, we can also apply different aggregation functions across different columns. For that, we need to pass a dictionary with key containing the column names and values containing the list of aggregation functions for any specific column.

In [7]:
# We are going to find aggregation for these columns 
df.aggregate({ "Age":['max', 'min'], 
              "Weight":['max', 'min'],  
              "Salary":['sum']}) 

Unnamed: 0,Age,Weight,Salary
max,40.0,307.0,
min,19.0,161.0,
sum,,,2159837000.0


### Summary

- We can split the data in a dataframe by using the function `groupby()`
    - We can group data by selecting one or multiple columns
- We can perform aggregation by using the function `aggregate` to conveniently get the sum, min, max or other statistical info about our dataframe.