# Lecture 8

Group, Join

# Grouping

Classifying variables

In [None]:
#: imports!

import numpy as np
from datascience import *

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## Grouping consists of Split - Apply - Combine

![grouping_workflow.png](./grouping_workflow.png)

## Our familiar NBA data...

In [None]:
#: read from csv and relabel
nba = Table.read_table('nba_salaries.csv').relabeled("'15-'16 SALARY", 'SALARY')
nba

## How big is each team?

- We know how to do this: `.group()`.
- Can visualize distribution of team sizes with `.hist()`.

In [None]:
nba.group('TEAM').hist('count')

## How much does each team pay in payroll?

- Instead of counting, we want to sum the `SALARY` column.

In [None]:
nba.group('TEAM', np.sum)

## What happened with the columns?

In [None]:
#: check it out...
nba.group('TEAM', np.sum)

- `np.sum` is applied to all columns (besides `TEAM`)
- But we can't sum all columns. E.g., `PLAYER`.
- In those cases: empty column.

## Which position has the highest average salary?

- We need to group by position.
- Within each group, find the average.
- Then sort by average salary.

In [None]:
nba.group('POSITION', np.mean).sort('SALARY mean', descending=True)

## What is the max salary of each position?

- Group by position.
- Within each group, use `max`.

In [None]:
#: like so...
nba.group('POSITION', max)

## Discussion question

Does Zaza Pachulia make 22.3594 million per year?

1. Yes
2. No
3. I cannot tell from this table.

In [None]:
#: ...
nba.group('POSITION', max)

## How does `.group()` work?

In [None]:
#: prints what it is given
def our_aggregator(group):
    print('I was given', group)
    # try len, sum, list...
    return list(group)

In [None]:
#: select 5 random Cs and PFs
subset = nba.where('POSITION', are.contained_in(['C', 'PF'])).sample(6)
subset

In [None]:
subset.group('POSITION', our_aggregator)

## For each position, which team has the most players at that position?

- We want to count...
- but sizes of groups within groups.
- i.e., sizes of position groups within teams.

In [None]:
nba.group(['TEAM', 'POSITION']).sort('count', descending=True).sort('POSITION', distinct=True)

## What are the number of players at each position on *every* team?

In [None]:
nba.group(['TEAM', 'POSITION'])

## A better approach: `.pivot()`:

In [None]:
nba.pivot('POSITION', 'TEAM')

## `.pivot()` can do more than count...

- What is the *average* salary of each position on every team?

In [None]:
nba.pivot('POSITION', 'TEAM', 'SALARY', np.mean)

# Join

Combining columns from two different tables

## Example

In [None]:
#: table of products
products = Table(['Location', 'Product', 'Price']).with_rows([
    ['Cups', 'Green Tea', 1.25],
    ['Cups', 'Latte', 2.50],
    ['Cups', 'Drip Coffee', 1.00],
    ['Art of Espresso', 'Espresso', 2.00],
    ['Art of Espresso', 'Latte', 3.00],
    ['Perks', 'Drip Coffee', 1.25],
    ['Perks', 'Green Tea', 1.50]
])
products

## Example

In [None]:
#: table of coupons
coupons = Table(['Location', 'Discount']).with_rows([
    ['Cups', .25],
    ['Art of Espresso', .10]
])
coupons

## How do we calculate discounted price of each product?

- Idea: "cross-reference" tables.
- I.e., for each row in `products`, find discount in `coupons` for that row's `Location`.
- This is what `.join()` does:

In [None]:
discounted = products.join('Location', coupons)
discounted

In [None]:
discounted.with_column(
    'Discounted Price',
    np.round(discounted.column('Price') * (1 - discounted.column('Discount')), 2)
)

## The `.join()` method:

- `this_table.join(common_column, that_table)`
- Only contains rows with values of `common_column` which appear in *both* tables.
    - For example, Perks was omitted.
- What if the "common columns" have different names?
- `this_table.join(this_column, that_table, that_column)`

## Example

In [None]:
cafes = coupons.relabeled('Location', 'Cafe')
cafes

In [None]:
products.join('Location', cafes, 'Cafe')