In [6]:
import pandas as pd
path_data = '../../../assets/data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import warnings
warnings.filterwarnings('ignore')

nhl = pd.read_csv(path_data + 'nhl_salaries.csv')
nhl = nhl.drop(columns=['Rank'])

# Classifying by One Variable

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the top movies dataset, it's common to ask:

**What was the total gross (earnings) in each year?** 

This type of question involves classifying individuals into categories that are not numerical. To classify individuals into categories we use **grouping**. 

## Counting the Number in Each Category
The `groupby()` method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

Here is a small dataframe of data on ice cream cones. The `groupby()` method can be used to list the distinct flavors and provide the counts of each flavor.

In [18]:
cones = pd.DataFrame(
    {'Flavor': ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate'],
    'Price': [3.55, 4.75, 6.55, 5.25, 5.25]}
)
cones

Unnamed: 0,Flavor,Price
0,strawberry,3.55
1,chocolate,4.75
2,chocolate,6.55
3,strawberry,5.25
4,chocolate,5.25


In [8]:
cones.groupby('Flavor')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x103a15eb0>

`.groupby()` returns a strange-looking `DataFrameGroupBy` object. It has split the data into separate groups. You can see the groups by applying the display function to this `gropupby` object. (If you are not familiar with `apply()` don't worry we will cover it in detail later in this chapter.)

In [9]:
cones.groupby('Flavor').apply(display)

Unnamed: 0,Flavor,Price
1,chocolate,4.75
2,chocolate,6.55
4,chocolate,5.25


Unnamed: 0,Flavor,Price
0,strawberry,3.55
3,strawberry,5.25


You may also want to view different properties of the `groupby` object. Here is how you can do it:

In [10]:
cones.groupby('Flavor').describe()

Unnamed: 0_level_0,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Flavor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
chocolate,3.0,5.516667,0.929157,4.75,5.0,5.25,5.9,6.55
strawberry,2.0,4.4,1.202082,3.55,3.975,4.4,4.825,5.25


Though the `groupby()` method grouped the dataframe, it won’t affect any non-grouping columns unless explicitly told to do so. So we need a function for aggregation. `pandas` provides a range of aggregate functions including `sum()`, `mean()`, `count()`, `max()`, and `min()`, necessary to perform effective data analysis. We can also call `.agg()` on this object with an aggregation function in order to apply your own defined aggregation. We will limit ourselves to the built-in aggregations only.

Let's first count the number of rows in each category:

In [11]:
cones.groupby('Flavor').count()

Unnamed: 0_level_0,Price
Flavor,Unnamed: 1_level_1
chocolate,3
strawberry,2


Notice that this can all be worked out from just the `Flavor` column. The `Price` column has not been used.

But what if we wanted the total price of the cones of each different flavor? We can call another aggregate function `sum()` to get the information. Note that the column used for grouping becomes the index column in the grouped table. If you don't want to use that as index you can simply assign false to `as_index`(an optional argument) 

In [12]:
cones.groupby('Flavor').sum()

Unnamed: 0_level_0,Price
Flavor,Unnamed: 1_level_1
chocolate,16.55
strawberry,8.8


**Note:** Choosing the appropriate aggregate function crucial for data analysis, which often depends on the question and the datatype of the column.

## Example: NHL Salaries

The dataframe `nhl` contains data on the 2023-2024 players in the National Hockey League. We will answer some datascience questions about this dataset using grouping.

In [13]:
nhl

Unnamed: 0,Name,Team,Position,Salary
0,Nathan MacKinnon,COL,C,12600000
1,Connor McDavid,EDM,C,12500000
2,Artemi Panarin,NYR,LW,11642857
3,Auston Matthews,TOR,C,11640250
4,Erik Karlsson,PIT,D,11500000
...,...,...,...,...
796,Stefan Noesen,CAR,RW,762500
797,Vincent Desharnais,EDM,D,762500
798,Boris Katchouk,OTT,LW,758333
799,Taylor Raddysh,CHI,RW,758333


**1.** How much money did each team pay for its players' salaries?

The only columns involved are `Team` and `Salary`. We have to group the rows by `Team` and then `sum()` the salaries of the groups. 

In [14]:
teams_and_money = nhl.loc[:,['Team', 'Salary']]
teams_and_money.groupby('Team').sum()

Unnamed: 0_level_0,Salary
Team,Unnamed: 1_level_1
ANA,68864167
ARI,65684872
BOS,85920000
BUF,71523570
CAR,95409417
CBJ,75754166
CGY,72827500
CHI,58468333
COL,88550000
DAL,87196667


**2.** How many NHL players were there in each of the five positions?

We just need the `Name` and `Position` columns to count the number of players in each position. Then we have to classify by `Position`, and count.

In [15]:
nhl[['Name','Position']].groupby('Position').count()

Unnamed: 0_level_0,Name
Position,Unnamed: 1_level_1
C,240
D,260
G,73
LW,130
RW,98


**3.** What was the average salary of the players at each of the five positions?

This time, we have to group by `Position` and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.

In [16]:
positions_and_money = nhl[['Position', 'Salary']]
positions_and_money.groupby('Position').mean()

Unnamed: 0_level_0,Salary
Position,Unnamed: 1_level_1
C,3419478.0
D,3199451.0
G,3018196.0
LW,3298865.0
RW,3027276.0


Center was the most highly paid position, at an average of over 3 million dollars.

If we had not selected the two columns as our first step, `group` would attempt to "average" the categorical columns in `nhl` and will generate an error. (It is impossible to average two strings like "Nathan MacKinnon" and "Connor McDavid".) 

In [17]:
try:
	nhl.groupby('Position').mean()
except TypeError as err:
	print(err)

agg function failed [how->mean,dtype->object]
