![rmotr](https://user-images.githubusercontent.com/7065401/39119486-4718e386-46ec-11e8-9fc3-5250a49ef570.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39119910-5f70eaa4-46ed-11e8-8236-b68568c39971.jpg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Creating Groups

If our data is continuous, we can also create groups with a few different mechanisms:

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)

## Hands on!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('data/nba_small_demo.csv', index_col=0)
df

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LeBron James,32.0,SF,33285709,2018,2017,CLE
Paul Millsap,31.0,PF,31269231,2018,2017,DEN
Stephen Curry,28.0,PG,34682550,2018,2017,GSW
Kevin Durant,28.0,PF,25000000,2018,2017,GSW
Klay Thompson,26.0,SG,17826150,2018,2017,GSW
Blake Griffin,27.0,PF,29512900,2018,2017,LAC
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS


![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### `qcut` & `cut`

We've already seen `qcut` and `cut`, but as a reminder:

In [3]:
pd.qcut(df['salary'], 3, labels=['Low Salary', 'Mid Range', 'High Salary'])

Player
LeBron James         High Salary
Paul Millsap         High Salary
Stephen Curry        High Salary
Kevin Durant          Low Salary
Klay Thompson         Low Salary
Blake Griffin          Mid Range
Russell Westbrook      Mid Range
Carmelo Anthony        Mid Range
Kawhi Leonard         Low Salary
Manu Ginobili         Low Salary
Name: salary, dtype: category
Categories (3, object): [Low Salary < Mid Range < High Salary]

In [4]:
df['Salary Range'] = pd.qcut(df['salary'], 3, labels=['Low Salary', 'Mid Range', 'High Salary'])

In [5]:
df

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team,Salary Range
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
LeBron James,32.0,SF,33285709,2018,2017,CLE,High Salary
Paul Millsap,31.0,PF,31269231,2018,2017,DEN,High Salary
Stephen Curry,28.0,PG,34682550,2018,2017,GSW,High Salary
Kevin Durant,28.0,PF,25000000,2018,2017,GSW,Low Salary
Klay Thompson,26.0,SG,17826150,2018,2017,GSW,Low Salary
Blake Griffin,27.0,PF,29512900,2018,2017,LAC,Mid Range
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC,Mid Range
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC,Mid Range
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS,Low Salary
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS,Low Salary


In [6]:
df['Age'].groupby(df['Salary Range']).min()

Salary Range
Low Salary     25.0
Mid Range      27.0
High Salary    28.0
Name: Age, dtype: float64

![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### More flexibility with `apply`

Running a custom function over each value can give you a lot more flexibility when you need to create your groups. For example, let's divide our players in "older than 30 y/o or not":

In [7]:
def older_than_30(x):
    if x['Age'] >= 30:
        return 1
    else:
        return 0

In [8]:
df.apply(older_than_30, axis=1)

Player
LeBron James         1
Paul Millsap         1
Stephen Curry        0
Kevin Durant         0
Klay Thompson        0
Blake Griffin        0
Russell Westbrook    0
Carmelo Anthony      1
Kawhi Leonard        0
Manu Ginobili        1
dtype: int64

In [9]:
df['Older than 30'] = df.apply(older_than_30, axis=1)
df

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team,Salary Range,Older than 30
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
LeBron James,32.0,SF,33285709,2018,2017,CLE,High Salary,1
Paul Millsap,31.0,PF,31269231,2018,2017,DEN,High Salary,1
Stephen Curry,28.0,PG,34682550,2018,2017,GSW,High Salary,0
Kevin Durant,28.0,PF,25000000,2018,2017,GSW,Low Salary,0
Klay Thompson,26.0,SG,17826150,2018,2017,GSW,Low Salary,0
Blake Griffin,27.0,PF,29512900,2018,2017,LAC,Mid Range,0
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC,Mid Range,0
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC,Mid Range,1
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS,Low Salary,0
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS,Low Salary,1


In [10]:
df.sort_values('Pos')

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team,Salary Range,Older than 30
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Paul Millsap,31.0,PF,31269231,2018,2017,DEN,High Salary,1
Kevin Durant,28.0,PF,25000000,2018,2017,GSW,Low Salary,0
Blake Griffin,27.0,PF,29512900,2018,2017,LAC,Mid Range,0
Stephen Curry,28.0,PG,34682550,2018,2017,GSW,High Salary,0
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC,Mid Range,0
LeBron James,32.0,SF,33285709,2018,2017,CLE,High Salary,1
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC,Mid Range,1
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS,Low Salary,0
Klay Thompson,26.0,SG,17826150,2018,2017,GSW,Low Salary,0
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS,Low Salary,1


In [11]:
df['salary'].groupby([df['Pos'], df['Older than 30']]).max()

Pos  Older than 30
PF   0                29512900
     1                31269231
PG   0                34682550
SF   0                18868625
     1                33285709
SG   0                17826150
     1                 2500000
Name: salary, dtype: int64

![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### Transform

Sometimes you need to combine "group-wise" operations with "element-wise" operations. Usually, when you need to compare an individual with some property of the group it belongs to. For example, let's analyze each players salary with respect to their Position. For example, how much a player makes compared to the highest paid player **in his position**. Let's start with all the positions and the max value of each one of them:

In [12]:
df['salary'].groupby(df['Pos']).max().to_frame().sort_index()

Unnamed: 0_level_0,salary
Pos,Unnamed: 1_level_1
PF,31269231
PG,34682550
SF,33285709
SG,17826150


In [13]:
df.sort_values('Pos')

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team,Salary Range,Older than 30
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Paul Millsap,31.0,PF,31269231,2018,2017,DEN,High Salary,1
Kevin Durant,28.0,PF,25000000,2018,2017,GSW,Low Salary,0
Blake Griffin,27.0,PF,29512900,2018,2017,LAC,Mid Range,0
Stephen Curry,28.0,PG,34682550,2018,2017,GSW,High Salary,0
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC,Mid Range,0
LeBron James,32.0,SF,33285709,2018,2017,CLE,High Salary,1
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC,Mid Range,1
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS,Low Salary,0
Klay Thompson,26.0,SG,17826150,2018,2017,GSW,Low Salary,0
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS,Low Salary,1


For example, the highest salary in the position `PF` is `$31,269,231` ("Paul Millsap"). Let's subtract the salary of the rest of the players **in the same position**:

In [14]:
df.loc[df['Pos'] == 'PF', 'salary'] - 31269231

Player
Paul Millsap           0
Kevin Durant    -6269231
Blake Griffin   -1756331
Name: salary, dtype: int64

Again, we're comparing a single individual with its own group. In this case, the highest salary in that individual's position. But in this case, we've hardcoded the max salary in the `PF` position. How can we do it dynamically? The answer is `transform`:

In [15]:
 df['salary'].groupby(df['Pos']).transform('max')

Player
LeBron James         33285709
Paul Millsap         31269231
Stephen Curry        34682550
Kevin Durant         31269231
Klay Thompson        17826150
Blake Griffin        31269231
Russell Westbrook    34682550
Carmelo Anthony      33285709
Kawhi Leonard        33285709
Manu Ginobili        17826150
Name: salary, dtype: int64

Transform takes a function to apply to each group, but it "broadcasts" the result to ALL the individuals of that group. Maybe it'll look cleaner if we add it as another column in the original DataFrame:

In [16]:
df['Max salary by Position'] = df['salary'].groupby(df['Pos']).transform('max')

In [17]:
df.sort_values('Pos')

Unnamed: 0_level_0,Age,Pos,salary,season_end,season_start,team,Salary Range,Older than 30,Max salary by Position
Player,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Paul Millsap,31.0,PF,31269231,2018,2017,DEN,High Salary,1,31269231
Kevin Durant,28.0,PF,25000000,2018,2017,GSW,Low Salary,0,31269231
Blake Griffin,27.0,PF,29512900,2018,2017,LAC,Mid Range,0,31269231
Stephen Curry,28.0,PG,34682550,2018,2017,GSW,High Salary,0,34682550
Russell Westbrook,28.0,PG,28530608,2018,2017,OKC,Mid Range,0,34682550
LeBron James,32.0,SF,33285709,2018,2017,CLE,High Salary,1,33285709
Carmelo Anthony,32.0,SF,26243760,2018,2017,OKC,Mid Range,1,33285709
Kawhi Leonard,25.0,SF,18868625,2018,2017,SAS,Low Salary,0,33285709
Klay Thompson,26.0,SG,17826150,2018,2017,GSW,Low Salary,0,17826150
Manu Ginobili,39.0,SG,2500000,2018,2017,SAS,Low Salary,1,17826150


As you can see, all the players in position `PF` have the same value under `Max salary by Position`, the max salary found in that position, which is, as we previously saw: `$31,269,231` ("Paul Millsap").

So now, we could just subtract each individual's salary to the max of their group:

In [18]:
df['salary'] - df['Max salary by Position']

Player
LeBron James                0
Paul Millsap                0
Stephen Curry               0
Kevin Durant         -6269231
Klay Thompson               0
Blake Griffin        -1756331
Russell Westbrook    -6151942
Carmelo Anthony      -7041949
Kawhi Leonard       -14417084
Manu Ginobili       -15326150
dtype: int64

We didn't need to store `'Max salary by Position'` in the DataFrame to achieve these results. We could have done it all in one line:

In [19]:
df['salary'] - df['salary'].groupby(df['Pos']).transform('max')

Player
LeBron James                0
Paul Millsap                0
Stephen Curry               0
Kevin Durant         -6269231
Klay Thompson               0
Blake Griffin        -1756331
Russell Westbrook    -6151942
Carmelo Anthony      -7041949
Kawhi Leonard       -14417084
Manu Ginobili       -15326150
Name: salary, dtype: int64

In [20]:
def my_f(arg):
    assert False, arg

In [21]:
df['Max salary by Position'] = df['salary'].groupby(df['Pos']).transform(my_f)

AssertionError: Player
Paul Millsap     31269231
Kevin Durant     25000000
Blake Griffin    29512900
Name: PF, dtype: int64

In [None]:
df

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)