# Lecture 4 - Selecting Rows

### Announcements

- Lab 02 due tomorrow
- HW 02 out tonight.  Due Sunday.
- Project 01 out Friday (hopefully)
- No class on Monday due to holiday, but Lab 03 will release
- Waitlist issues should be sorted

# The datascience module
* see documentation here: http://data8.org/datascience/tables.html
* similar to Pandas dataframes, R data.frame, Spark dataframes

## Why not use a "real" library for tables?
* production libraries contain ugly warts and bloat.
* The datascience module highlights only what's important.
* Easy to transition to production libraries (do it, if you're movtivated!).

# Anatomy of a table

In [3]:
#: the usual imports
from datascience import *
import numpy as np

In [4]:
# read './minard.csv'
minard = Table.read_table('minard.csv')
minard

Longitude,Latitude,City,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


### Shape of a table:
* number of columns,
* number of rows

In [5]:
minard.num_columns

5

In [6]:
minard.num_rows

8

### labels and relabeling columns
* `.labels` and `.relabeled(old_name, new_name)`
* `.relabeled` returns a new table (doesn't change the current one)

In [7]:
minard.labels

('Longitude', 'Latitude', 'City', 'Direction', 'Survivors')

In [10]:
minard = minard.relabeled('City', 'City Name')

### Selecting columns and table elements
* `.column()` takes a column name/index; returns an array.
* `.select()` takes column(s) (name/index); returns a table.

In [11]:
# access a column (array)
minard.column('City Name')

array(['Smolensk', 'Dorogobouge', 'Chjat', 'Moscou', 'Wixma', 'Smolensk',
       'Orscha', 'Moiodexno'], dtype='<U11')

In [12]:
# access an element of the table
minard.column('City Name').item(0)

'Smolensk'

In [13]:
minard.select('City Name', 'Latitude')

City Name,Latitude
Smolensk,54.8
Dorogobouge,54.9
Chjat,55.5
Moscou,55.8
Wixma,55.2
Smolensk,54.6
Orscha,54.4
Moiodexno,54.3


In [14]:
lat_long_cols = ['Latitude', 'Longitude']
minard.select(lat_long_cols)

Latitude,Longitude
54.8,32.0
54.9,33.2
55.5,34.4
55.8,37.6
55.2,34.3
54.6,32.0
54.4,30.4
54.3,26.8


### Adding columns: percentage of surviving troops at step k
* use `.with_column(col_name, array)`
* use `PercentFormatter` using `.set_format(col, formatter)`

In [15]:
initial = minard.column('Survivors').item(0)
minard = minard.with_column(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,1.0
33.2,54.9,Dorogobouge,Advance,140000,0.965517
34.4,55.5,Chjat,Advance,127100,0.876552
37.6,55.8,Moscou,Advance,100000,0.689655
34.3,55.2,Wixma,Retreat,55000,0.37931
32.0,54.6,Smolensk,Retreat,24000,0.165517
30.4,54.4,Orscha,Retreat,20000,0.137931
26.8,54.3,Moiodexno,Retreat,12000,0.0827586


In [16]:
# format the percent column
minard.set_format('Percent Surviving', PercentFormatter)

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,100.00%
33.2,54.9,Dorogobouge,Advance,140000,96.55%
34.4,55.5,Chjat,Advance,127100,87.66%
37.6,55.8,Moscou,Advance,100000,68.97%
34.3,55.2,Wixma,Retreat,55000,37.93%
32.0,54.6,Smolensk,Retreat,24000,16.55%
30.4,54.4,Orscha,Retreat,20000,13.79%
26.8,54.3,Moiodexno,Retreat,12000,8.28%


### Dropping columns
* `.drop(cols)`

In [17]:
minard.drop(lat_long_cols)

City Name,Direction,Survivors,Percent Surviving
Smolensk,Advance,145000,100.00%
Dorogobouge,Advance,140000,96.55%
Chjat,Advance,127100,87.66%
Moscou,Advance,100000,68.97%
Wixma,Retreat,55000,37.93%
Smolensk,Retreat,24000,16.55%
Orscha,Retreat,20000,13.79%
Moiodexno,Retreat,12000,8.28%


In [18]:
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,100.00%
33.2,54.9,Dorogobouge,Advance,140000,96.55%
34.4,55.5,Chjat,Advance,127100,87.66%
37.6,55.8,Moscou,Advance,100000,68.97%
34.3,55.2,Wixma,Retreat,55000,37.93%
32.0,54.6,Smolensk,Retreat,24000,16.55%
30.4,54.4,Orscha,Retreat,20000,13.79%
26.8,54.3,Moiodexno,Retreat,12000,8.28%


### Discussion Question

|How would you calculate the average of the numbers in last column of `minard`?|
|---|
|`A. sum(minard.select('Survivors')) / minard.num_rows`|
|`B. sum(minard.column('Survivors')) / minard.num_rows`|
|`C.                                Both A and B work.`|
|`D.                             Neither A nor B work.`|

In [22]:
sum(minard.column('Survivors')) / minard.num_rows

77887.5

In [21]:
minard.column('Survivors')

array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

# Summary of Table methods

Description|datascience module methods
---|---
Creating and extending tables:| `Table.read_table` and `Table().with_columns`
Finding the size| `num_rows` and `num_columns`
Referring to columns: labels, relabeling, and indices | `labels` and `relabeled`; column indices start at 0
Accessing data in a column|`column` takes a label or index and returns an array
Using array methods to work with data in columns|`item, sum, min, max`, and so on
Creating new tables containing some of the original columns:| `select, drop`

# Sorting Tables

* The `sort` method creates a new table with the same rows in a different order (the original table is unaffected).

* The `show` method displays the first rows of a table

In [23]:
# Larger tables have rows omitted
# read in ./nba_salaries.csv
nba = Table.read_table('nba_salaries.csv')
nba

PLAYER,POSITION,TEAM,'15-'16 SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717
Al Horford,C,Atlanta Hawks,12.0
Tiago Splitter,C,Atlanta Hawks,9.75625
Jeff Teague,PG,Atlanta Hawks,8.0
Kyle Korver,SG,Atlanta Hawks,5.74648
Thabo Sefolosha,SF,Atlanta Hawks,4.0
Mike Scott,PF,Atlanta Hawks,3.33333
Kent Bazemore,SF,Atlanta Hawks,2.0
Dennis Schroder,PG,Atlanta Hawks,1.7634
Tim Hardaway Jr.,SG,Atlanta Hawks,1.30452


### In 2015-16, what was the total payroll for all NBA teams combined?

In [24]:
sum(nba.column("'15-'16 SALARY"))

2116.1976390000013

### What's the largest salary in the NBA in 2015-16? Who earned it?

In [25]:
nba.column("'15-'16 SALARY").max()

25.0

In [26]:
nba.sort("'15-'16 SALARY", descending=True).show(5)

PLAYER,POSITION,TEAM,'15-'16 SALARY
Kobe Bryant,SF,Los Angeles Lakers,25.0
Joe Johnson,SF,Brooklyn Nets,24.8949
LeBron James,SF,Cleveland Cavaliers,22.9705
Carmelo Anthony,SF,New York Knicks,22.875
Dwight Howard,C,Houston Rockets,22.3594


### What are the optional arguments for `sort`?

* `descending=True`, sorts the column in descending order (default: ascending order).
* `distinct=True`, omits repeated values of the column, keeping the only the first. 

# Discussion Question

|Which line of code creates a table of the highest-paid players in each position?|
|---|
|`A.  nba.sort(3, descending=True).sort(1, distinct=False)`|
|`B.  nba.sort(3, descending=False).sort(1, distinct=True)`|
|`C. nba.sort(3, descending=False).sort(1, distinct=False)`|
|`D.   nba.sort(3, descending=True).sort(1, distinct=True)`|

In [30]:
nba.sort(3, descending=True).sort(1, distinct=True)

PLAYER,POSITION,TEAM,'15-'16 SALARY
Dwight Howard,C,Houston Rockets,22.3594
Chris Bosh,PF,Miami Heat,22.1927
Chris Paul,PG,Los Angeles Clippers,21.4687
Kobe Bryant,SF,Los Angeles Lakers,25.0
Dwyane Wade,SG,Miami Heat,20.0


# Digression: Lists

Let's make an array, but have it hold objects of different types:

In [31]:
data = make_array(1, 3.1415, 'n/a')
data

array(['1', '3.1415', 'n/a'], dtype='<U32')

Wait, what is the type of the first element?

In [32]:
type(data.item(0))

str

## Lists are generic sequences

- Arrays should only hold objects of one type.
- But `list`s can hold things of different types.

In [33]:
[1, 3.1415, 'hey']

[1, 3.1415, 'hey']

## Why use arrays instead of lists?

Big reason: arrays are fast.

In [34]:
#: an array and a list with the same data
n = 9_999_999
arr = np.arange(n)
lst = list(range(n))

In [35]:
%timeit arr.sum()

18.8 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [36]:
%timeit sum(lst)

1.16 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Creating tables from lists

Passing a list to `.with_column` converts it to an array implicity.

In [None]:
data = Table().with_column('Stuff', [1, 3.1415, 'hey'])
data

But look at the types...

In [None]:
data.column('Stuff')

# Getting rows

In [37]:
#: let's have some data to work with
nba = Table.read_table('nba_salaries.csv')
nba

PLAYER,POSITION,TEAM,'15-'16 SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717
Al Horford,C,Atlanta Hawks,12.0
Tiago Splitter,C,Atlanta Hawks,9.75625
Jeff Teague,PG,Atlanta Hawks,8.0
Kyle Korver,SG,Atlanta Hawks,5.74648
Thabo Sefolosha,SF,Atlanta Hawks,4.0
Mike Scott,PF,Atlanta Hawks,3.33333
Kent Bazemore,SF,Atlanta Hawks,2.0
Dennis Schroder,PG,Atlanta Hawks,1.7634
Tim Hardaway Jr.,SG,Atlanta Hawks,1.30452


## Another `.take()`...

- We know that `.select()` returns a table with the requested columns.
- To get a *table* with requested rows, use `.take()`
- As with `.item()`, counting starts with 0!

In [38]:
nba

PLAYER,POSITION,TEAM,'15-'16 SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717
Al Horford,C,Atlanta Hawks,12.0
Tiago Splitter,C,Atlanta Hawks,9.75625
Jeff Teague,PG,Atlanta Hawks,8.0
Kyle Korver,SG,Atlanta Hawks,5.74648
Thabo Sefolosha,SF,Atlanta Hawks,4.0
Mike Scott,PF,Atlanta Hawks,3.33333
Kent Bazemore,SF,Atlanta Hawks,2.0
Dennis Schroder,PG,Atlanta Hawks,1.7634
Tim Hardaway Jr.,SG,Atlanta Hawks,1.30452


In [39]:
# get the first element
nba.take(0)

PLAYER,POSITION,TEAM,'15-'16 SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717


## Multiple rows

Take multiple rows by providing a list:

In [40]:
nba.take([0, 5, 6])

PLAYER,POSITION,TEAM,'15-'16 SALARY
Paul Millsap,PF,Atlanta Hawks,18.6717
Thabo Sefolosha,SF,Atlanta Hawks,4.0
Mike Scott,PF,Atlanta Hawks,3.33333


In [41]:
#: indices have to be in a list!
nba.take(0, 5, 6)

TypeError: __call__() takes 2 positional arguments but 4 were given

## Discuss

For columns, we have:

- `.select()`: returns a table
- `.column()`: returns and array

For rows, we just have:

- `.take()`: returns a table

Why don't we have something that returns a row as an array?

# Retrieving rows conditionally

- We want to grab a subset of rows conditionally.
- Examples:
    - All NBA players who make over 20 million / year.
    - All point guards (PGs) and centers (Cs).
    - Any player with more than 20 letters in their name

## Predicates

- A predicate is a function that returns `True` or `False`.
- Apply predicate to each item in a column.
- Keep entries for which it is `True`.
- Discard those for which it is `False`.

## The `.where()` method

- Applies a predicate over a column.
- Returns new table.
- Many predicates are provided.

In [42]:
nba.where("'15-'16 SALARY", are.above(20))

PLAYER,POSITION,TEAM,'15-'16 SALARY
Joe Johnson,SF,Brooklyn Nets,24.8949
Derrick Rose,PG,Chicago Bulls,20.0931
LeBron James,SF,Cleveland Cavaliers,22.9705
Dwight Howard,C,Houston Rockets,22.3594
Chris Paul,PG,Los Angeles Clippers,21.4687
Kobe Bryant,SF,Los Angeles Lakers,25.0
Chris Bosh,PF,Miami Heat,22.1927
Carmelo Anthony,SF,New York Knicks,22.875
Kevin Durant,SF,Oklahoma City Thunder,20.1586


## Provided predicates

|Predicate|Description|
|---------|-----------|
|`are.above(y)`|Greater than y|
|`are.above_or_equal_to(y)`|Greater than or equal to y|
|`are.below(y)`|Less than y|
|`are.below_or_equal_to(y)`|Less than or equal to y|
|`are.between(y, z)`|Greater than or equal to y and less than z|
|`are.between_or_equal_to(y, z)`|Greater than or equal to y and less than or equal to z|
|`are.contained_in(superstring)`|A string that is part of the given superstring|
|`are.containing(substring)`|A string that contains within it the given substring|
|`are.equal_to(y)`|Equal to y|
|`are.not_above(y)`|Is not above |
|`are.not_above_or_equal_to(y)`|Is neither above y nor equal to |
|`are.not_below(y)`|Is not below |
|`are.not_below_or_equal_to(y)`|Is neither below y nor equal to |
|`are.not_between(y, z)`|Is equal to y or less than y or greater than |
|`are.not_between_or_equal_to(y, z)`|Is less than y or greater than |
|`are.not_contained_in(superstring)`|A string that is not contained within the superstrin|
|`are.not_containing(substring)`|A string that does not contain substrin|
|`are.not_equal_to(y)`|Is not equal to |
|`are.not_strictly_between(y, z)`|Is equal to y or equal to z or less than y or greater than |
|`are.strictly_between(y, z)`|Greater than y and less than z|

## Example

Get Lebron's row.

In [43]:
nba.where('PLAYER', are.equal_to('LeBron James'))

PLAYER,POSITION,TEAM,'15-'16 SALARY
LeBron James,SF,Cleveland Cavaliers,22.9705


`are.equal_to` is the default behavior:

In [44]:
nba.where('PLAYER', 'LeBron James')

PLAYER,POSITION,TEAM,'15-'16 SALARY
LeBron James,SF,Cleveland Cavaliers,22.9705


## Example

Grab all players with a salary between 5 and 6 million.

In [45]:
nba.where("'15-'16 SALARY", are.between(5, 6))

PLAYER,POSITION,TEAM,'15-'16 SALARY
Kyle Korver,SG,Atlanta Hawks,5.74648
Jonas Jerebko,PF,Boston Celtics,5.0
Courtney Lee,SG,Charlotte Hornets,5.675
Nikola Mirotic,PF,Chicago Bulls,5.54373
Deron Williams,PG,Dallas Mavericks,5.37897
Zaza Pachulia,C,Dallas Mavericks,5.2
JJ Hickson,C,Denver Nuggets,5.6135
Shaun Livingston,PG,Golden State Warriors,5.54373
Chase Budinger,SF,Indiana Pacers,5.0
Jamal Crawford,SG,Los Angeles Clippers,5.675


## Example

Grab all of the Golden State Warriors players.

In [48]:
nba.where('TEAM', are.containing('Golden State')).show(15)

PLAYER,POSITION,TEAM,'15-'16 SALARY
Klay Thompson,SG,Golden State Warriors,15.501
Draymond Green,PF,Golden State Warriors,14.2609
Andrew Bogut,C,Golden State Warriors,13.8
Andre Iguodala,SF,Golden State Warriors,11.7105
Stephen Curry,PG,Golden State Warriors,11.3708
Jason Thompson,PF,Golden State Warriors,7.00847
Shaun Livingston,PG,Golden State Warriors,5.54373
Harrison Barnes,SF,Golden State Warriors,3.8734
Marreese Speights,C,Golden State Warriors,3.815
Leandro Barbosa,SG,Golden State Warriors,2.5


In [47]:
are.containing('Golden State')

<datascience.predicates._combinable at 0x29b0bfd5048>

## Example

Find all teammates of LeBron James.

In [None]:
lebron = nba.where('PLAYER', are.containing('LeBron'))
team = lebron.column('TEAM').item(0)
nba.where('TEAM', are.equal_to(team))

## Predicates are functions!

In [None]:
# this creates a new function, f
f = are.above(30)

In [None]:
f(31)

In [None]:
f(29)

## Looking ahead: creating own predicates

In Python, as in most languages, you can create your own functions.

In [None]:
#: we can define our own predicates...
def name_is_20_or_more_letters(name):
    return len(name) >= 20

In [None]:
nba.where('PLAYER', name_is_20_or_more_letters)

# Discussion Question

`.with_row()` works like `.with_column()`...

The table `nba` has columns `PLAYER`, `POSITION`, `TEAM`, `SALARY`.

What is the output when we execute a cell containing these two lines of code?

```
nba.with_row(['Jazz Bear', 'Mascot', 'Utah Jazz', 100])
nba.where('PLAYER', are.containing('Bear'))
```

* A. A table with one row for Jazz Bear
* B. An empty table with no rows
* C. An error message

# Practice: Census data

- Every ten years, the U.S. Census Bureau counts the number of people in the U.S.
- On other years, the bureau *estimates* the population
- Data is published online

## What does the file look like?

In [None]:
# prints the first few lines of the file
!head census.csv

## Read the data from disk

In [None]:
#: the usual code...
census = Table.read_table('census.csv')
census

## Or, read the data from a URL directly

In [None]:
#: the url to the data file
url = 'http://inferentialthinking.com/notebooks/nc-est2015-agesex-res.csv'

In [None]:
# download the data from the World Wide Web
census = Table.read_table(url)
census

## What do we have?

A description of the dataset is available at [census.gov](https://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf).

Unfortunately...

<center>
<img src="./moved.png" width=50%>
</center>

## What are the column labels?

In [None]:
census.labels

## What values occur in each column?

Use `np.unique` to get the unique values.

In [None]:
# what ages occur in the data?
np.unique(census.column('AGE'))

Wait, `999`?

## What is `999` used for?

In [None]:
census.where('AGE', 999)

- It looks like `999` means *all ages together*.
- Similarly, a `SEX` of `0` means *all sexes together*.

## Discuss

Using the data alone, how might we make an educated guess as to which value of `SEX` means "male" and which means "female"?

In [None]:
census.where('AGE', 87)

- Women tend to live longer. This suggests that `1` is "male" and `2` is "female".

## Analyzing population trends

Let's look at how the population changed between 2010 and 2015.

In [None]:
# we only need a few columns
us_pop = census.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2015')

In [None]:
# we don't like to type POPESTIMATE2010. something shorter...
us_pop = us_pop.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2015', '2015')
us_pop

## Population changes

Compute absolute and percentage changes.

In [None]:
us_pop = us_pop.with_column(
    'CHANGE',
    us_pop.column('2015') - us_pop.column('2010')
)

In [None]:
us_pop = us_pop.with_column(
    'PCT CHANGE',
    us_pop.column('CHANGE') / us_pop.column('2010')
    )
us_pop

## Display percentages nicely

In [None]:
us_pop.set_format('PCT CHANGE', PercentFormatter)

## Discussion Question

In [None]:
#: Given this data...
us_pop.where('AGE', 999).where('SEX', 0)

What does this code calculate?

`(321418820 / 309346863) ** (1/5) - 1`

|Responses|
|---------|
|A. The ratio of the population in 2015 to the population in 2010.|
|B. The precentage by which the population changed from 2010 to 2015.|
|C. The annual growth rate for the population from 2010 to 2015.|
|D. It doesn't compute anything meaningful.|

## What age group(s) grew the most in size?

1. Any guesses?
2. How could we find out?

In [None]:
us_pop.sort('CHANGE', descending=True)

## Why?

In [None]:
2010 - 68

In [None]:
2015 - 68

The post-WWII baby boom.

## How does female:male ratio change with age?

General approach:

1. Keep data for only one year (say, 2015).
2. Make a table of females and a table of males.
3. Divide # of females at age by # of males at age

## 1. Keep data for only 2015

In [None]:
us_pop_2015 = us_pop.select('SEX', 'AGE', '2015')
us_pop

## 2. Make a table of females and a table of males

In [None]:
females = us_pop_2015.where('SEX', 2).where('AGE', are.not_equal_to(999))

In [None]:
males = us_pop_2015.where('SEX', 1).where('AGE', are.not_equal_to(999))

## 3. Divide # of females at age by # of males at same age

In [None]:
# we have to "align" the data first (if it isn't already aligned)
females = females.sort('AGE')
males = males.sort('AGE')

In [None]:
ratios = Table().with_columns(
    'AGE', females.column('AGE'),
    'F:M RATIO', females.column('2015') / males.column('2015')
)
ratios

## Visualize

In [None]:
#: a few new imports
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
ratios.plot('AGE')