In [10]:
from datascience import *
path_data = '../../../assets/data/'
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

nhl = Table.read_table(path_data + 'nhl_salaries.csv')
nhl = nhl.drop('Rank')

# Classifying by One Variable

Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the top movies dataset, it's common to ask:

**What was the total gross in each year?** 

This type of question involves classifying individuals into categories that are not numerical. To classify individuals into categories we use **grouping**. 

## Counting the Number in Each Category
The `group` method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

Here is a small table of data on ice cream cones. The `group` method can be used to list the distinct flavors and provide the counts of each flavor.

In [11]:
cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate'),
    'Price', make_array(3.55, 4.75, 6.55, 5.25, 5.25)
)
cones

Flavor,Price
strawberry,3.55
chocolate,4.75
chocolate,6.55
strawberry,5.25
chocolate,5.25


In [12]:
cones.group('Flavor')

Flavor,count
chocolate,3
strawberry,2


There are two distinct categories, chocolate and strawberry. The call to `group` creates a table of counts in each category. The column is called `count` by default, and contains the number of rows in each category.

Notice that this can all be worked out from just the `Flavor` column. The `Price` column has not been used.

But what if we wanted the total price of the cones of each different flavor? That's where the second argument of `group` comes in.

## Finding a Characteristic of Each Category
The optional second argument of `group` names the function that will be used to aggregate values in other columns for all of those rows. For instance, `sum` will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.

To find the total price of each flavor, we call `group` again, with `Flavor` as its first argument as before. But this time there is a second argument: the function name `sum`.

In [13]:
cones.group('Flavor', sum)

Flavor,Price sum
chocolate,16.55
strawberry,8.8


To create this new table, `group` has calculated the sum of the `Price` entries in all the rows corresponding to each distinct flavor. The prices in the three `chocolate` rows add up to $\$16.55$ (you can assume that price is being measured in dollars). The prices in the two `strawberry` rows have a total of $\$8.80$.

The label of the newly created "sum" column is `Price sum`, which is created by taking the label of the column being summed, and appending the word `sum`. 

Because `group` finds the `sum` of all columns other than the one with the categories, there is no need to specify that it has to `sum` the prices.

To see in more detail what `group` is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:

In [14]:
cones.where('Flavor', are.equal_to('chocolate')).column('Price')

array([ 4.75,  6.55,  5.25])

In [15]:
sum(cones.where('Flavor', are.equal_to('chocolate')).column('Price'))

16.550000000000001

This is what `group` is doing for each distinct value in `Flavor`.

In [16]:
# For each distinct value in `Flavor, access all the rows
# and create an array of `Price`

cones_choc = cones.where('Flavor', are.equal_to('chocolate')).column('Price')
cones_strawb = cones.where('Flavor', are.equal_to('strawberry')).column('Price')

# Display the arrays in a table

grouped_cones = Table().with_columns(
    'Flavor', make_array('chocolate', 'strawberry'),
    'Array of All the Prices', make_array(cones_choc, cones_strawb)
)

# Append a column with the sum of the `Price` values in each array

price_totals = grouped_cones.with_column(
    'Sum of the Array', make_array(sum(cones_choc), sum(cones_strawb))
)
price_totals

Flavor,Array of All the Prices,Sum of the Array
chocolate,[ 4.75 6.55 5.25],16.55
strawberry,[ 3.55 5.25],8.8


You can replace `sum` by any other functions that work on arrays. For example, you could use `max` to find the largest price in each category:

In [17]:
cones.group('Flavor', max)

Flavor,Price max
chocolate,6.55
strawberry,5.25


Once again, `group` creates arrays of the prices in each `Flavor` category. But now it finds the `max` of each array:

In [18]:
price_maxes = grouped_cones.with_column(
    'Max of the Array', make_array(max(cones_choc), max(cones_strawb))
)
price_maxes

Flavor,Array of All the Prices,Max of the Array
chocolate,[ 4.75 6.55 5.25],6.55
strawberry,[ 3.55 5.25],5.25


Indeed, the original call to `group` with just one argument has the same effect as using `len` as the function and then cleaning up the table.

In [19]:
lengths = grouped_cones.with_column(
    'Length of the Array', make_array(len(cones_choc), len(cones_strawb))
)
lengths

Flavor,Array of All the Prices,Length of the Array
chocolate,[ 4.75 6.55 5.25],3
strawberry,[ 3.55 5.25],2


## Example: NHL Salaries
The table `nhl` contains data on the 2023-2024 players in the National Hockey League. We will answer some datascience questions about this dataset using grouping.

In [20]:
nhl

Name,Team,Position,Salary
Nathan MacKinnon,COL,C,12600000
Connor McDavid,EDM,C,12500000
Artemi Panarin,NYR,LW,11642857
Auston Matthews,TOR,C,11640250
Erik Karlsson,PIT,D,11500000
David Pastrnak,BOS,RW,11250000
Drew Doughty,LAK,D,11000000
John Tavares,TOR,C,11000000
Mitchell Marner,TOR,RW,10903000
Carey Price,MTL,G,10500000


**1.** How much money did each team pay for its players' salaries?

The only columns involved are `Team` and `Salary`. We have to `group` the rows by `Team` and then `sum` the salaries of the groups. 

In [21]:
teams_and_money = nhl.select('Team', 'Salary')
teams_and_money.group('Team', sum)

Team,Salary sum
ANA,68864167
ARI,65684872
BOS,85920000
BUF,71523570
CAR,95409417
CBJ,75754166
CGY,72827500
CHI,58468333
COL,88550000
DAL,87196667


**2.** How many NHL players were there in each of the five positions?

We have to classify by `Position`, and count. This can be done with just one argument to group:

In [22]:
nhl.group('Position')

Position,count
C,240
D,260
G,73
LW,130
RW,98


**3.** What was the average salary of the players at each of the five positions?

This time, we have to group by `Position` and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.

In [23]:
positions_and_money = nhl.select('Position', 'Salary')
positions_and_money.group('Position', np.mean)

Position,Salary mean
C,3419480.0
D,3199450.0
G,3018200.0
LW,3298870.0
RW,3027280.0


Center was the most highly paid position, at an average of over 3 million dollars.

If we had not selected the two columns as our first step, `group` would not attempt to "average" the categorical columns in `nhl`. (It is impossible to average two strings like "Nathan MacKinnon" and "Connor McDavid".) It performs arithmetic only on numerical columns and leaves the rest blank.

In [24]:
nhl.group('Position', np.mean)

Position,Name mean,Team mean,Salary mean
C,,,3419480.0
D,,,3199450.0
G,,,3018200.0
LW,,,3298870.0
RW,,,3027280.0
