# Lecture 4 - Selecting Rows

### Announcements

- Lab 02 due tomorrow
- HW 02 out tonight.  Due Sunday.
- Project 01 out Friday (hopefully)
- No class on Monday due to holiday, but Lab 03 will release
- Waitlist issues should be sorted

# The datascience module
* see documentation here: http://data8.org/datascience/tables.html
* similar to Pandas dataframes, R data.frame, Spark dataframes

## Why not use a "real" library for tables?
* production libraries contain ugly warts and bloat.
* The datascience module highlights only what's important.
* Easy to transition to production libraries (do it, if you're movtivated!).

# Anatomy of a table

In [None]:
#: the usual imports
from datascience import *
import numpy as np

In [None]:
# read './minard.csv'
minard = Table.read_table('minard.csv')
minard

### Shape of a table:
* number of columns,
* number of rows

In [None]:
minard.num_columns

In [None]:
minard.num_rows

### labels and relabeling columns
* `.labels` and `.relabeled(old_name, new_name)`
* `.relabeled` returns a new table (doesn't change the current one)

In [None]:
minard.labels

In [None]:
minard.relabeled('City', 'City Name')

### Selecting columns and table elements
* `.column()` takes a column name/index; returns an array.
* `.select()` takes column(s) (name/index); returns a table.

In [None]:
# access a column (array)
minard.column('City Name')

In [None]:
# access an element of the table
minard.column('City Name').item(0)

In [None]:
minard.select('City Name', 'Latitude')

In [None]:
lat_long_cols = ['Latitude', 'Longitude']
minard.select(lat_long_cols)

### Adding columns: percentage of surviving troops at step k
* use `.with_column(col_name, array)`
* use `PercentFormatter` using `.set_format(col, formatter)`

In [None]:
initial = minard.column('Survivors').item(0)
minard = minard.with_column(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard

In [None]:
# format the percent column
minard.set_format('Percent Surviving', PercentFormatter)

### Dropping columns
* `.drop(cols)`

In [None]:
minard.drop(lat_long_cols)

In [None]:
minard

### Discussion Question

|How would you calculate the average of the numbers in last column of `minard`?|
|---|
|`A. sum(minard.select('Survivors')) / minard.num_rows`|
|`B. sum(minard.column('Survivors')) / minard.num_rows`|
|`C.                                Both A and B work.`|
|`D.                             Neither A nor B work.`|

# Summary of Table methods

Description|datascience module methods
---|---
Creating and extending tables:| `Table.read_table` and `Table().with_columns`
Finding the size| `num_rows` and `num_columns`
Referring to columns: labels, relabeling, and indices | `labels` and `relabeled`; column indices start at 0
Accessing data in a column|`column` takes a label or index and returns an array
Using array methods to work with data in columns|`item, sum, min, max`, and so on
Creating new tables containing some of the original columns:| `select, drop`

# Sorting Tables

* The `sort` method creates a new table with the same rows in a different order (the original table is unaffected).

* The `show` method displays the first rows of a table

In [None]:
# Larger tables have rows omitted
# read in ./nba_salaries.csv
nba = Table.read_table('nba_salaries.csv')
nba

### In 2015-16, what was the total payroll for all NBA teams combined?

In [None]:
sum(nba.column("'15-'16 SALARY"))

### What's the largest salary in the NBA in 2015-16? Who earned it?

In [None]:
nba.column("'15-'16 SALARY").max()

In [None]:
nba.sort("'15-'16 SALARY", descending=True).show(5)

### What are the optional arguments for `sort`?

* `descending=True`, sorts the column in descending order (default: ascending order).
* `distinct=True`, omits repeated values of the column, keeping the only the first. 

# Discussion Question

|Which line of code creates a table of the highest-paid players in each position?|
|---|
|`A.  nba.sort(3, descending=True).sort(1, distinct=False)`|
|`B.  nba.sort(3, descending=False).sort(1, distinct=True)`|
|`C. nba.sort(3, descending=False).sort(1, distinct=False)`|
|`D.   nba.sort(3, descending=True).sort(1, distinct=True)`|

# Digression: Lists

Let's make an array, but have it hold objects of different types:

In [None]:
data = make_array(1, 3.1415, 'n/a')
data

Wait, what is the type of the first element?

In [None]:
type(data.item(0))

## Lists are generic sequences

- Arrays should only hold objects of one type.
- But `list`s can hold things of different types.

In [None]:
[1, 3.1415, 'hey']

## Why use arrays instead of lists?

Big reason: arrays are fast.

In [None]:
#: an array and a list with the same data
n = 9_999_999
arr = np.arange(n)
lst = list(range(n))

In [None]:
%timeit arr.sum()

In [None]:
%timeit sum(lst)

## Creating tables from lists

Passing a list to `.with_column` converts it to an array implicity.

In [None]:
data = Table().with_column('Stuff', [1, 3.1415, 'hey'])
data

But look at the types...

In [None]:
data.column('Stuff')

# Getting rows

In [None]:
#: let's have some data to work with
nba = Table.read_table('nba_salaries.csv')
nba

## Another `.take()`...

- We know that `.select()` returns a table with the requested columns.
- To get a *table* with requested rows, use `.take()`
- As with `.item()`, counting starts with 0!

In [None]:
nba

In [None]:
# get the first element
nba.take(0)

## Multiple rows

Take multiple rows by providing a list:

In [None]:
nba.take([0, 5, 6])

In [None]:
#: indices have to be in a list!
nba.take(0, 5, 6)

## Discuss

For columns, we have:

- `.select()`: returns a table
- `.column()`: returns and array

For rows, we just have:

- `.take()`: returns a table

Why don't we have something that returns a row as an array?

# Retrieving rows conditionally

- We want to grab a subset of rows conditionally.
- Examples:
    - All NBA players who make over 20 million / year.
    - All point guards (PGs) and centers (Cs).
    - Any player with more than 20 letters in their name

## Predicates

- A predicate is a function that returns `True` or `False`.
- Apply predicate to each item in a column.
- Keep entries for which it is `True`.
- Discard those for which it is `False`.

## The `.where()` method

- Applies a predicate over a column.
- Returns new table.
- Many predicates are provided.

In [None]:
nba.where("'15-'16 SALARY", are.above(20))

## Provided predicates

|Predicate|Description|
|---------|-----------|
|`are.above(y)`|Greater than y|
|`are.above_or_equal_to(y)`|Greater than or equal to y|
|`are.below(y)`|Less than y|
|`are.below_or_equal_to(y)`|Less than or equal to y|
|`are.between(y, z)`|Greater than or equal to y and less than z|
|`are.between_or_equal_to(y, z)`|Greater than or equal to y and less than or equal to z|
|`are.contained_in(superstring)`|A string that is part of the given superstring|
|`are.containing(substring)`|A string that contains within it the given substring|
|`are.equal_to(y)`|Equal to y|
|`are.not_above(y)`|Is not above |
|`are.not_above_or_equal_to(y)`|Is neither above y nor equal to |
|`are.not_below(y)`|Is not below |
|`are.not_below_or_equal_to(y)`|Is neither below y nor equal to |
|`are.not_between(y, z)`|Is equal to y or less than y or greater than |
|`are.not_between_or_equal_to(y, z)`|Is less than y or greater than |
|`are.not_contained_in(superstring)`|A string that is not contained within the superstrin|
|`are.not_containing(substring)`|A string that does not contain substrin|
|`are.not_equal_to(y)`|Is not equal to |
|`are.not_strictly_between(y, z)`|Is equal to y or equal to z or less than y or greater than |
|`are.strictly_between(y, z)`|Greater than y and less than z|

## Example

Get Lebron's row.

In [None]:
nba.where('PLAYER', are.equal_to('LeBron James'))

`are.equal_to` is the default behavior:

In [None]:
nba.where('PLAYER', 'LeBron James')

## Example

Grab all players with a salary between 5 and 6 million.

In [None]:
nba.where("'15-'16 SALARY", are.between(5, 6))

## Example

Grab all of the Golden State Warriors players.

In [None]:
nba.where('TEAM', are.containing('Golden State'))

## Example

Find all teammates of LeBron James.

In [None]:
lebron = nba.where('PLAYER', are.containing('LeBron'))
team = lebron.column('TEAM').item(0)
nba.where('TEAM', team)

## Predicates are functions!

In [None]:
# this creates a new function, f
f = are.above(30)

In [None]:
f(31)

In [None]:
f(29)

## Looking ahead: creating own predicates

In Python, as in most languages, you can create your own functions.

In [None]:
#: we can define our own predicates...
def name_is_20_or_more_letters(name):
    return len(name) >= 20

In [None]:
nba.where('PLAYER', name_is_20_or_more_letters)

# Discussion Question

`.with_row()` works like `.with_column()`...

The table `nba` has columns `PLAYER`, `POSITION`, `TEAM`, `SALARY`.

What is the output when we execute a cell containing these two lines of code?

```
nba.with_row(['Jazz Bear', 'Mascot', 'Utah Jazz', 100])
nba.where('PLAYER', are.containing('Bear'))
```

* A. A table with one row for Jazz Bear
* B. An empty table with no rows
* C. An error message

# Practice: Census data

- Every ten years, the U.S. Census Bureau counts the number of people in the U.S.
- On other years, the bureau *estimates* the population
- Data is published online

## What does the file look like?

In [None]:
# prints the first few lines of the file
!head census.csv

## Read the data from disk

In [None]:
#: the usual code...
census = Table.read_table('census.csv')
census

## Or, read the data from a URL directly

In [None]:
#: the url to the data file
url = 'http://inferentialthinking.com/notebooks/nc-est2015-agesex-res.csv'

In [None]:
# download the data from the World Wide Web
census = Table.read_table(url)
census

## What do we have?

A description of the dataset is available at [census.gov](https://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf).

Unfortunately...

<center>
<img src="./moved.png" width=50%>
</center>

## What are the column labels?

In [None]:
census.labels

## What values occur in each column?

Use `np.unique` to get the unique values.

In [None]:
# what ages occur in the data?
np.unique(census.column('AGE'))

Wait, `999`?

## What is `999` used for?

In [None]:
census.where('AGE', 999)

- It looks like `999` means *all ages together*.
- Similarly, a `SEX` of `0` means *all sexes together*.

## Discuss

Using the data alone, how might we make an educated guess as to which value of `SEX` means "male" and which means "female"?

In [None]:
census.where('AGE', 87)

- Women tend to live longer. This suggests that `1` is "male" and `2` is "female".

## Analyzing population trends

Let's look at how the population changed between 2010 and 2015.

In [None]:
# we only need a few columns
us_pop = census.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2015')

In [None]:
# we don't like to type POPESTIMATE2010. something shorter...
us_pop = us_pop.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2015', '2015')
us_pop

## Population changes

Compute absolute and percentage changes.

In [None]:
us_pop = us_pop.with_column(
    'CHANGE',
    us_pop.column('2015') - us_pop.column('2010')
)

In [None]:
us_pop = us_pop.with_column(
    'PCT CHANGE',
    us_pop.column('CHANGE') / us_pop.column('2010')
    )
us_pop

## Display percentages nicely

In [None]:
us_pop.set_format('PCT CHANGE', PercentFormatter)

## Discussion Question

In [None]:
#: Given this data...
us_pop.where('AGE', 999).where('SEX', 0)

What does this code calculate?

`(321418820 / 309346863) ** (1/5) - 1`

|Responses|
|---------|
|A. The ratio of the population in 2015 to the population in 2010.|
|B. The precentage by which the population changed from 2010 to 2015.|
|C. The annual growth rate for the population from 2010 to 2015.|
|D. It doesn't compute anything meaningful.|

## What age group(s) grew the most in size?

1. Any guesses?
2. How could we find out?

In [None]:
us_pop.sort('CHANGE', descending=True)

## Why?

In [None]:
2010 - 68

In [None]:
2015 - 68

The post-WWII baby boom.

## How does female:male ratio change with age?

General approach:

1. Keep data for only one year (say, 2015).
2. Make a table of females and a table of males.
3. Divide # of females at age by # of males at age

## 1. Keep data for only 2015

In [None]:
us_pop_2015 = us_pop.select('SEX', 'AGE', '2015')
us_pop

## 2. Make a table of females and a table of males

In [None]:
females = us_pop_2015.where('SEX', 2).where('AGE', are.not_equal_to(999))

In [None]:
males = us_pop_2015.where('SEX', 1).where('AGE', are.not_equal_to(999))

## 3. Divide # of females at age by # of males at same age

In [None]:
# we have to "align" the data first (if it isn't already aligned)
females = females.sort('AGE')
males = males.sort('AGE')

In [None]:
ratios = Table().with_columns(
    'AGE', females.column('AGE'),
    'F:M RATIO', females.column('2015') / males.column('2015')
)
ratios

## Visualize

In [None]:
#: a few new imports
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
ratios.plot('AGE')