# Lecture 4 - Selecting Rows

### Announcements

- Lab 02 due tonight.
- HW 02 due Sunday.

In [None]:
from datascience import *
import numpy as np

<center><img src="summary1.png"  width="800"/></center>

# Sorting Tables

* The `sort` method creates a new table with the same rows in a different order (the original table is unaffected).

* The `show` method displays the first rows of a table

In [None]:
nba = Table.read_table('nba_salaries.csv') # Larger tables have rows omitted
nba 

# About the data

<center><img src="nba_description.png"  width="800"/></center>

### In 2015-16, what was the total payroll for all NBA teams combined?

### What's the largest salary in the NBA in 2015-16? Who earned it?

### How about the top 5?

### What are the optional arguments for `sort`?
* `descending=True`, sorts the column in descending order (default: ascending order).
* `distinct=True`, omits repeated values of the column, keeping the only the first. 

### What does the code below do?

In [None]:
nba.sort(3, descending=True).sort(1, distinct=True)

<center><img src="q7.png"  width="1000"/></center>

In [None]:
nba.sort(3, descending=True).sort(1, distinct=True)

# Digression: Lists

What happens when you make an array with objects of different types?

In [None]:
data = make_array(1, 3.1415, 'n/a')
data

Wait, what is the type of the first element?

In [None]:
type(data.item(0))

## Lists are generic sequences

- Arrays should only hold objects of one type.
- But `list`s can hold objects of different types.

In [None]:
[1, 3.1415, 'hey']

## If lists are more genereal, why use arrays at all?

Big reason: arrays are fast.

In [None]:
# an array and a list with the same data
n = 9_999_999
arr = np.arange(n)
lst = list(range(n))

In [None]:
%timeit arr.sum() # work with array

In [None]:
%timeit sum(lst) # work with list

## Creating tables from lists

Passing a list to `.with_column` converts it to an array implicity.

In [None]:
data = Table().with_column('Stuff', [1, 3.1415, 'hey'])
data

But look at the types...

In [None]:
data.column('Stuff')

# Getting rows

In [None]:
nba = Table.read_table('nba_salaries.csv')
nba

## `.select()` columns, `.take()`rows

- We know that `.select()` returns a table with the requested columns
- To get a *table* with requested rows, use `.take()`
- As with `.item()`, counting starts with 0

In [None]:
# get the first row
nba.take(0)

## Multiple rows

Take multiple rows by providing a list of row indices:

In [None]:
nba.take([0,5,6])

In [None]:
#: indices have to be in a list or array
nba.take(0, 5, 6)

## Discuss

For columns, we have:

- `.select()`: returns a table
- `.column()`: returns an array

For rows, we just have:

- `.take()`: returns a table

Why don't we have something that returns a row as an array?

# Retrieving a row 

In [None]:
nba.row(0)

In [None]:
#compare to
nba.take(0)

# Retrieving rows conditionally

- We often want to grab a subset of rows *conditionally*, when some condition is satisfied.
- Examples:
    - All NBA players who make over 20 million / year.
    - All point guards (PGs) and centers (Cs).
    - Any player with more than 20 letters in their name

## Predicates

- A predicate is a function that returns `True` or `False`.
- We use predicates as conditions, keeping only rows that satisfy the condition.
- Apply a predicate to each item in a column.
    - Keep entries for which it is `True`.
    - Discard those for which it is `False`.

## The `.where()` method

- Applies a predicate to a column.
- Returns a new table containing only the rows where the predicate is `True`.
- Many predicates are provided.

In [None]:
nba.where("'15-'16 SALARY", are.above(20))

## Provided predicates

|Predicate|Description|
|---------|-----------|
|`are.above(y)`|Greater than y|
|`are.above_or_equal_to(y)`|Greater than or equal to y|
|`are.below(y)`|Less than y|
|`are.below_or_equal_to(y)`|Less than or equal to y|
|`are.between(y, z)`|Greater than or equal to y and less than z|
|`are.between_or_equal_to(y, z)`|Greater than or equal to y and less than or equal to z|
|`are.contained_in(superstring)`|A string that is part of the given superstring|
|`are.containing(substring)`|A string that contains within it the given substring|
|`are.equal_to(y)`|Equal to y|
|`are.not_above(y)`|Is not above |
|`are.not_above_or_equal_to(y)`|Is neither above y nor equal to |
|`are.not_below(y)`|Is not below |
|`are.not_below_or_equal_to(y)`|Is neither below y nor equal to |
|`are.not_between(y, z)`|Is equal to y or less than y or greater than |
|`are.not_between_or_equal_to(y, z)`|Is less than y or greater than |
|`are.not_contained_in(superstring)`|A string that is not contained within the superstrin|
|`are.not_containing(substring)`|A string that does not contain substrin|
|`are.not_equal_to(y)`|Is not equal to |
|`are.not_strictly_between(y, z)`|Is equal to y or equal to z or less than y or greater than |
|`are.strictly_between(y, z)`|Greater than y and less than z|

## Example

Get LeBron's row.

In [None]:
nba.where('PLAYER', are.equal_to('LeBron James'))

`are.equal_to` is the default behavior:

In [None]:
nba.where('PLAYER', 'LeBron James')

## Example

Grab all players with a salary between 5 and 6 million.

In [None]:
nba.where("'15-'16 SALARY", are.between(5, 6))

## Example

Grab all of the players from Los Angeles .

In [None]:
nba.where('TEAM', are.containing('Los Angeles')).show(15)

## Example

Find all teammates of LeBron James.

In [None]:
lebron # make table with just LeBron's row
team # extract his team name
# make table with just his teammates

## Example

Create an array containing the names of all point guards (PG) who made more than 15 million dollars.


## Predicates are functions!

In [None]:
# this creates a new function, f
f = are.above(30)

In [None]:
f(31)

In [None]:
f(29)

## Looking ahead: creating own predicates

In Python, as in most languages, you can create your own functions.

In [None]:
#: we can define our own predicates...
def name_is_20_or_more_letters(name):
    return len(name) >= 20

In [None]:
nba.where('PLAYER', name_is_20_or_more_letters)

# Discussion Question

`.with_row()` works like `.with_column()`, except you give a *list* of row entries. 

The table `nba` has columns `PLAYER`, `POSITION`, `TEAM`, `SALARY`. What is the output when we execute a cell containing these two lines of code?

```
nba.with_row(['Jazz Bear', 'Mascot', 'Utah Jazz', 100])
nba.where('PLAYER', are.containing('Bear'))
```

A. A table with one row for Jazz Bear  
B. An empty table with no rows  
C. An error message

In [None]:
nba.with_row(['Jazz Bear', 'Mascot', 'Utah Jazz', 100])
nba.where('PLAYER', are.containing('Bear'))

<center><img src="summary2.png"  width="800"/></center>

# Practice: Census data

- Every ten years, the U.S. Census Bureau counts the number of people in the U.S.
- On other years, the bureau *estimates* the population
- Data is published online

In [None]:
census = Table.read_table('census.csv')
census

## What do we have?

A description of the dataset is available at [census.gov](https://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf).

Unfortunately...

<center>
<img src="./moved.png" width=50%>
</center>

## What are the column labels?

In [None]:
census.labels

## What values occur in each column?

Use `np.unique` to get the unique values.

In [None]:
# what ages occur in the data?
np.unique(census.column('AGE'))

Wait, `999`?

## What is `999` used for?

In [None]:
census.where('AGE', 999)

- It looks like `999` means *all ages together*.
- Similarly, a `SEX` of `0` means *all sexes together*.

## Discuss

Using the data alone, how might we make an educated guess as to which value of `SEX` means "male" and which means "female"?

In [None]:
census.where('AGE', 87)

- Women tend to live longer. This suggests that `1` is "male" and `2` is "female".

## Analyzing population trends

Let's look at how the population changed between 2010 and 2015.

In [None]:
# we only need a few columns
us_pop = census.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2015')

In [None]:
# we don't like to type POPESTIMATE2010. Relabel to something shorter...
us_pop = us_pop.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2015', '2015')
us_pop

## Population changes

Compute absolute and percentage changes.

In [None]:
us_pop = us_pop.with_column(
    'CHANGE',
    us_pop.column('2015') - us_pop.column('2010')
)

In [None]:
us_pop = us_pop.with_column(
    'PCT CHANGE',
    us_pop.column('CHANGE') / us_pop.column('2010')
    )
us_pop

## Display percentages nicely

In [None]:
us_pop.set_format('PCT CHANGE', PercentFormatter)

## What age group(s) grew the most in size?

1. Any guesses?
2. How could we find out?

In [None]:
us_pop.sort('CHANGE', descending=True)

## Why?

In [None]:
2010 - 68

In [None]:
2015 - 68

The post-WWII baby boom.

## How does female:male ratio change with age?

General approach:  
1. Keep data for only one year (say, 2015).
2. Make a table of females and a table of males.
3. Divide # of females at each age by # of males at that age.

## 1. Keep data for only 2015

In [None]:
us_pop_2015 = us_pop.select('SEX', 'AGE', '2015')
us_pop_2015

## 2. Make a table of females and a table of males

In [None]:
females = us_pop_2015.where('SEX', 2).where('AGE', are.not_equal_to(999))
females

In [None]:
males = us_pop_2015.where('SEX', 1).where('AGE', are.not_equal_to(999))
males

## 3. Divide # of females at each age by # of males at that age

In [None]:
# we should "align" the data first to make sure rows are in the same order
females = females.sort('AGE')
males = males.sort('AGE')

In [None]:
ratios = Table().with_columns(
    'AGE', females.column('AGE'),
    'F:M RATIO', females.column('2015') / males.column('2015')
)
ratios

In [None]:
ratios.sort('AGE')

In [None]:
ratios.sort('AGE', descending=True)

## Visualize

In [None]:
#: a few new imports for displaying nice plots
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
ratios.plot('AGE')

## Why? Our data analysis raises questions that are the topic of research.

[Why do women live longer than men?](https://ourworldindata.org/why-do-women-live-longer-than-men)

[Why are there more baby boys than baby girls?](https://www.pewresearch.org/fact-tank/2013/09/24/the-odds-that-you-will-give-birth-to-a-boy-or-girl-depend-on-where-in-the-world-you-live/)