In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Tables Review
Recall from last lecture, we talked about manipulating rows.

## Manipulating Rows
1. `t.sort(column)` sorts the rows by a particular column in increasing order
2. `t.take(row_numbers)` takes rows out of a table based on their indices
    * Each `row` has an index, starting at 0
3. `t.where(column, are.condition)` filters rows based on a column and the certain condition that needs to be fulfilled
4. `t.where(column, value)` is  a shortcut to `t.where(column, are.equal_to)`.

## Discussion Question

In [2]:
nba = Table.read_table('nba_salaries.csv')

The table `nba` has columns `NAME`, `POSITION`, and `SALARY`.

1. Create an array containing the names of all point guards (PG) who make more than 15 million dollars per year.

In [3]:
# Relabel the label '2015-2016 SALARY' to 'SALARY' and 'PLAYER' to 'NAME'
nba = nba.relabeled('2015-2016 SALARY', 'SALARY').relabeled('PLAYER', 'NAME')

In [4]:
nba = nba.drop('TEAM')
nba

NAME,POSITION,SALARY
Paul Millsap,PF,18.6717
Al Horford,C,12.0
Tiago Splitter,C,9.75625
Jeff Teague,PG,8.0
Kyle Korver,SG,5.74648
Thabo Sefolosha,SF,4.0
Mike Scott,PF,3.33333
Kent Bazemore,SF,2.0
Dennis Schroder,PG,1.7634
Tim Hardaway Jr.,SG,1.30452


In [5]:
nba.where('POSITION', 'PG').where('SALARY', are.above(15)).column('NAME')

array(['Derrick Rose', 'Kyrie Irving', 'Chris Paul', 'Russell Westbrook',
       'John Wall'], dtype='<U24')

2. After evaluating the expressions below, what would be the outcome?

In [6]:
nba.with_row(['Sam Lau', 'Mascot', 100])
nba.where('NAME', are.containing('Lau'))

NAME,POSITION,SALARY
Joffrey Lauvergne,C,1.70972


Notice that we didn't get the entry that we insert using `with_row`! Recall that we need to reassign the `nba` table to actually make changes.

# Census Data

## The Decennial Census
* Every 10 years, the Census Bureau counts how many people there are in the U.S. 
    * However, the Bureau does not know exactly how many people are there each year
    * So the Bureau uses estimation to estimate the population each year
    
Article 1, Section 2 of the Constitution states: "Representative and direct Taxes shall be apportioned among the several States ... according to their Respective Numbers ...". 

Thus, the Bureau need the number of population in the U.S. to adjust taxes and State Representatives.

## Analyzing Census Data
Analyzing a population data leads to the discovery of interesting features and trends in the population.
We will load up the census `csv` data into the name `full_census_table`.

In [7]:
full_census_table = Table.read_table('census.csv')
full_census_table

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
0,0,3944153,3944160,3951330,3963087,3926540,3931141,3949775,3978038
0,1,3978070,3978090,3957888,3966551,3977939,3942872,3949776,3968564
0,2,4096929,4096939,4090862,3971565,3980095,3992720,3959664,3966583
0,3,4119040,4119051,4111920,4102470,3983157,3992734,4007079,3974061
0,4,4063170,4063186,4077551,4122294,4112849,3994449,4005716,4020035
0,5,4056858,4056872,4064653,4087709,4132242,4123626,4006900,4018158
0,6,4066381,4066412,4073013,4074993,4097605,4142916,4135930,4019207
0,7,4030579,4030594,4043046,4083225,4084913,4108349,4155326,4148360
0,8,4046486,4046497,4025604,4053203,4093177,4095711,4120903,4167887
0,9,4148353,4148369,4125415,4035710,4063152,4104072,4108349,4133564


Above, the column labels don't seem too descriptive. We can look at the [PDF](https://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf) from Census's website to see the explanation of each label.

## Census Table Description

Values have column-dependent interpretations
* The `SEX` column: `1` = Male, `2` = Female
* `POPESTIMATE2010` column: estimation on 07/01/2010

In this table, some rows are sums of other rows
* The `SEX` column: 0 = total of Male and Female
* The `AGE` column: 999 is total of all ages
    
Why use numeric code such as `1` and `2` rather than using the string `Male` and `Female`?
* Numeric codes are often used for storage efficiency
* Numbers use less text, which means more efficient

Be careful! Within a column, two or more values have the same type, but they are not necessarily comparable.
* e.g. comparing AGE 12 vs. AGE 999 won't give you a reasonable result because age 999 is not an actual age

We want to compare 2 population estimate: 2010 and 2015. The table `partial` below represents that.

In [8]:
partial = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2015')
partial

SEX,AGE,POPESTIMATE2010,POPESTIMATE2015
0,0,3951330,3978038
0,1,3957888,3968564
0,2,4090862,3966583
0,3,4111920,3974061
0,4,4077551,4020035
0,5,4064653,4018158
0,6,4073013,4019207
0,7,4043046,4148360
0,8,4025604,4167887
0,9,4125415,4133564


And we're going to relabel `POPESTIMATE2010` and `POPESTIMATE2015` in the table `us_pop` below.

In [9]:
us_pop = partial.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2015', '2015')
us_pop

SEX,AGE,2010,2015
0,0,3951330,3978038
0,1,3957888,3968564
0,2,4090862,3966583
0,3,4111920,3974061
0,4,4077551,4020035
0,5,4064653,4018158
0,6,4073013,4019207
0,7,4043046,4148360
0,8,4025604,4167887
0,9,4125415,4133564


Below we're going to calculate the estimate difference between year 2015 and 2010.

In [10]:
us_pop.column('2015') - us_pop.column('2010')

array([   26708,    10676,  -124279,  -137859,   -57516,   -46495,
         -53806,   105314,   142283,     8149,   -65773,    14817,
         -12258,   -35360,    39772,    18740,  -128956,  -182081,
        -273010,  -308827,  -205077,    68834,   242467,   435038,
         493743,   440136,   383610,   202740,   117128,   172853,
         112970,   235717,   376012,   408173,   472649,   431069,
         278576,   131637,   -93087,  -453601,  -397641,  -298250,
        -158454,     6864,   156657,   -49214,  -369143,  -461788,
        -456974,  -446546,  -245943,   -19310,    -6240,    24091,
         228080,   294415,   305984,   424727,   518075,   469416,
         509071,   434492,   306876,     -774,   830101,   771518,
         722799,   610480,  1076541,   364917,   429913,   467584,
         585785,   395748,   267716,   207945,   240361,   189912,
          98631,    67159,    31471,   -11559,    -9403,     1122,
          28568,    42010,    24532,    53443,    53098,    50

However, it is difficult to observe the difference in form of array like above. To make observation easier, we can put the array into the table in a different column.

In [11]:
change = us_pop.column('2015') - us_pop.column('2010')
census = us_pop.with_columns(
    'Change', change
)
census

SEX,AGE,2010,2015,Change
0,0,3951330,3978038,26708
0,1,3957888,3968564,10676
0,2,4090862,3966583,-124279
0,3,4111920,3974061,-137859
0,4,4077551,4020035,-57516
0,5,4064653,4018158,-46495
0,6,4073013,4019207,-53806
0,7,4043046,4148360,105314
0,8,4025604,4167887,142283
0,9,4125415,4133564,8149


We are also interested in the proportion of change in percentage. So we can add that into the table as well.

In [12]:
change = us_pop.column('2015') - us_pop.column('2010')
census = us_pop.with_columns(
    'Change', change,
    # Percent change with respect to estimate population in 2010
    'Percent Change', change / us_pop.column('2010'),
)
census.set_format('Percent Change', PercentFormatter) #Changes the format of the column "Percent Change"

SEX,AGE,2010,2015,Change,Percent Change
0,0,3951330,3978038,26708,0.68%
0,1,3957888,3968564,10676,0.27%
0,2,4090862,3966583,-124279,-3.04%
0,3,4111920,3974061,-137859,-3.35%
0,4,4077551,4020035,-57516,-1.41%
0,5,4064653,4018158,-46495,-1.14%
0,6,4073013,4019207,-53806,-1.32%
0,7,4043046,4148360,105314,2.60%
0,8,4025604,4167887,142283,3.53%
0,9,4125415,4133564,8149,0.20%


If we want to look up the age population that has the greatest change, we can `sort` the table!

In [13]:
census.sort('Change', descending=True)

SEX,AGE,2010,2015,Change,Percent Change
0,999,309346863,321418820,12071957,3.90%
1,999,152088043,158229297,6141254,4.04%
2,999,157258820,163189523,5930703,3.77%
0,68,2359816,3436357,1076541,45.62%
0,64,2706055,3536156,830101,30.68%
0,65,2678525,3450043,771518,28.80%
0,66,2621335,3344134,722799,27.57%
0,67,2693707,3304187,610480,22.66%
0,72,1883820,2469605,585785,31.10%
2,68,1254117,1812428,558311,44.52%


Be careful! The rows with age 999 refer to the total population of all ages!
1. The row with `SEX` 0 and `AGE` 999 is the total population of both genders and all ages
2. The row with `SEX` 1 and `AGE` 999 is the total male population of all ages
3. The row with `SEX` 2 and `AGE` 999 is the total female population of all ages

Recall from Lecture 3 - Expressions, the growth rate `g` in **exponential growth** can be calculated by the following,
\begin{align}
g = (\frac {after}{before})^ {{\frac {1}{t}} - {1}}
\end{align}

Where `t` is the time.

What if the growth rate of the total population each year if we assume exponential growth?

In [14]:
after = census.sort('Change', descending=True).column('2015').item(0)
before = census.sort('Change', descending=True).column('2010').item(0)
g = (after/before) ** (1/5) - 1
g

0.007685750230353783

This means every year, the total population grew about 0.7 to 0.8%.

Another way of doing the computation above, other than using the `column` method, is by using the `row` method. The `row` method returns a data type that can be treated as an `array`, but is not exactly an `array`.

In [15]:
first_row = census.sort('Change', descending = True).row(0)
first_row

Row(SEX=0, AGE=999, 2010=309346863, 2015=321418820, Change=12071957, Percent Change=0.039024016222204264)

In [16]:
after = first_row.item('2015') #Or we can use item(3)
after

321418820

In [17]:
before = first_row.item('2010')
g = (after/before) ** (1/5) - 1
g

0.007685750230353783

If we look at the table again with the sorted change,

In [18]:
census.sort('Change', descending=True)

SEX,AGE,2010,2015,Change,Percent Change
0,999,309346863,321418820,12071957,3.90%
1,999,152088043,158229297,6141254,4.04%
2,999,157258820,163189523,5930703,3.77%
0,68,2359816,3436357,1076541,45.62%
0,64,2706055,3536156,830101,30.68%
0,65,2678525,3450043,771518,28.80%
0,66,2621335,3344134,722799,27.57%
0,67,2693707,3304187,610480,22.66%
0,72,1883820,2469605,585785,31.10%
2,68,1254117,1812428,558311,44.52%


It seems that the most change occured for people of 68 years old! Why is that?

In [19]:
2010 - 68

1942

It appears that if somebody was 68 years old on 2010, that person was born on 1942, which is the year after the World War 2. Pearl Harbour bombing happened on 1941. During this time, not that many babies were born.

In [20]:
2015 - 68

1947

And if somebody was 68 years old on 2015, that person was born on 1947. 1947 was the year after the Nagasaki bombing (roughly in 1945-1946), and where soldiers came home to U.S. During this time, the number of babies born skyrocketed! This generation is called the baby boomers. And World War 2 was known as the incident that affected the U.S. population the most in the U.S. history since there was a huge change from the year where the least babies were born to the year where the most babies were born.

Now we'll look at the population of age above 97 years old.

**Note**: Be careful with the `table` method `show()`, as it displays all rows in a table. If we have a table with millions or trillions of rows and we used the method `show()` any argument, Python will try to display all the millions of rows!

In [21]:
us_pop.where('AGE', are.above(97)).show()

SEX,AGE,2010,2015
0,98,47037,61991
0,99,32178,43641
0,100,54410,76974
0,999,309346863,321418820
1,98,9505,14719
1,99,6104,9577
1,100,9352,15088
1,999,152088043,158229297
2,98,37532,47272
2,99,26074,34064


If we look at the first 3 rows from above,

In [22]:
us_pop.where('AGE', are.above(97)).show(3)

SEX,AGE,2010,2015
0,98,47037,61991
0,99,32178,43641
0,100,54410,76974


There are more people aged 98 than 99, which makes sense because people might pass away before getting to 99 years old. Strangely, there are a lot more people aged 100 than others! What happened?

It turns out that the `AGE` 100 takes into account people whose age are 100 and above (recall the data specification in the [PDF](https://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf)). 

This time, we will analyze the estimated population of all ages on 2015.

In [23]:
us_pop_2015 = us_pop.drop('2010')
all_ages = us_pop_2015.where('AGE', 999)
all_ages

SEX,AGE,2015
0,999,321418820
1,999,158229297
2,999,163189523


Again, sometimes we are interested in proportions rather than just numbers. The proportion can be computed by dividing the population of a certain gender with the population of both genders.

In [24]:
total = all_ages.column('2015').item(0)
all_ages.with_column(
    'Proportion', all_ages.column('2015') / total,
).set_format('Proportion', PercentFormatter)

SEX,AGE,2015,Proportion
0,999,321418820,100.00%
1,999,158229297,49.23%
2,999,163189523,50.77%


It appears that in 2015, it was estimated that there were more females than males! However, if we look at the population of freshly born babies,

In [25]:
infants = us_pop_2015.where('AGE', 0)
infants

SEX,AGE,2015
0,0,3978038
1,0,2035134
2,0,1942904


In [26]:
infants.with_column(
    'Proportion', infants.column('2015') / infants.column('2015').item(0),
).set_format('Proportion', PercentFormatter)

SEX,AGE,2015,Proportion
0,0,3978038,100.00%
1,0,2035134,51.16%
2,0,1942904,48.84%


It seems that there were more male babies born than the female babies born! Then why in the total population there are more females? (See next lecture)