# Tables Review

The table **nba** has columns **NAME**, **POSITION**, and **SALARY**.

Create an array containing the names of all point guards who make more than $15M/year

In [7]:
from datascience import *
nba_salary = Table.read_table('nba_salaries.csv') #This is the original table
#Below: Take PLAYER, POSITION, and '2015-2016 SALARY' from original table and assign it to 
# variable 'nba'
nba = nba_salary.select('PLAYER','POSITION', '2015-2016 SALARY') 
salary = "2015-2016 SALARY"
position = "POSITION"
nba.where(salary, are.above(15)).where(position, are.equal_to("PG")).column("PLAYER")

array(['Derrick Rose', 'Kyrie Irving', 'Chris Paul', 'Russell Westbrook',
       'John Wall'], dtype='<U24')

In [8]:
#After evaluating these 2 expressions in order, what's the result of the 2nd one?
nba.with_row(['Sam Lau', 'Mascot', 100])
# Sam Lau won't be in the table because if we look above, we didn't save the change to any variable!
nba.where('PLAYER', are.containing('Lau'))

PLAYER,POSITION,2015-2016 SALARY
Joffrey Lauvergne,C,1.70972


# Census Data

## The Decennial Census

* Every 10 years, the Census Bureau counts how many people there are in U.S.
* In between censuses, the Bureau estimates how many people there are each year
* Article 1, Section 2 of the constitution:
    * "Representative and direct Taxes shall be apportioned among the several States ... according to their respective Numbers..."

## Analyzing Census Data
Leads to the discovery of interesting features and trends in the population.

In [9]:
census = Table.read_table('census.csv')
census

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
0,0,3944153,3944160,3951330,3963087,3926540,3931141,3949775,3978038
0,1,3978070,3978090,3957888,3966551,3977939,3942872,3949776,3968564
0,2,4096929,4096939,4090862,3971565,3980095,3992720,3959664,3966583
0,3,4119040,4119051,4111920,4102470,3983157,3992734,4007079,3974061
0,4,4063170,4063186,4077551,4122294,4112849,3994449,4005716,4020035
0,5,4056858,4056872,4064653,4087709,4132242,4123626,4006900,4018158
0,6,4066381,4066412,4073013,4074993,4097605,4142916,4135930,4019207
0,7,4030579,4030594,4043046,4083225,4084913,4108349,4155326,4148360
0,8,4046486,4046497,4025604,4053203,4093177,4095711,4120903,4167887
0,9,4148353,4148369,4125415,4035710,4063152,4104072,4108349,4133564


## Census Table Description

* Values have column-dependent interpretations
    * The **SEX** column: 1 = Male, 2 = Female
    * **POPESTIMATE2010** column: 7/1/2010 estimate
* In this table, some rows are sums of other rows
    * The **SEX** column:0 = Total (Male + Female)
    * The **AGE** column: 999 is **total** of all ages
* Numeric codes are often used for storage efficiency
* Values in a column have the same type, but are not necessarily comparable (AGE 12 vs. AGE 999)


In [10]:
#Make a new table where it only contains the 2010 and 2015 population, and rename the 'population and year' column label
shortened = census.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2015').relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2015', '2015')
# Calculate the change of population between 2015 and 2010
change = shortened.column('2015') - shortened.column('2010') # 'difference' now contains an array of differences
#Now add 'Change' and 'Percent Change' to the new table
new = shortened.with_columns(
    'Change', change,
    'Percent Change', change/shortened.column('2010')
)
new

SEX,AGE,2010,2015,Change,Percent Change
0,0,3951330,3978038,26708,0.00675924
0,1,3957888,3968564,10676,0.0026974
0,2,4090862,3966583,-124279,-0.0303797
0,3,4111920,3974061,-137859,-0.0335267
0,4,4077551,4020035,-57516,-0.0141055
0,5,4064653,4018158,-46495,-0.0114389
0,6,4073013,4019207,-53806,-0.0132104
0,7,4043046,4148360,105314,0.0260482
0,8,4025604,4167887,142283,0.0353445
0,9,4125415,4133564,8149,0.00197532


Notice the table above that the percent change is not actually in percentage. We can change the formatting to actual percentage so that it looks better!

In [25]:
formatted = new.set_format('Percent Change', PercentFormatter).sort('Change', descending = True)
formatted

SEX,AGE,2010,2015,Change,Percent Change
0,999,309346863,321418820,12071957,3.90%
1,999,152088043,158229297,6141254,4.04%
2,999,157258820,163189523,5930703,3.77%
0,68,2359816,3436357,1076541,45.62%
0,64,2706055,3536156,830101,30.68%
0,65,2678525,3450043,771518,28.80%
0,66,2621335,3344134,722799,27.57%
0,67,2693707,3304187,610480,22.66%
0,72,1883820,2469605,585785,31.10%
2,68,1254117,1812428,558311,44.52%


If we assume **exponential growth** and we want to calculate the growth rate,

$$ Growth = (\frac{After}{Before})^ {\frac{1}{Interval}} $$

To do this, the **.row** method would help! It's similar to **.column**, but it takes all the values in a certain row. Be aware that this method doesn't return an array (but you can treat the return items like an array)!

In [21]:
first_row = formatted.row(0)
first_row

Row(SEX=0, AGE=999, 2010=309346863, 2015=321418820, Change=12071957, Percent Change=0.039024016222204264)

In [22]:
After = formatted.row(0).item(3)
Before = formatted.row(0).item(2)
Interval = 2015 - 2010
Growth = (After / Before) ** (1/Interval)
Growth

1.0076857502303538

Notice that the greatest absolute change was the population in the 64-68 age group. Why is that?
We can explore this question by examining the years in which they were born.
Those who were in 64-67 age group...
* In 2010 were born in 1943-1946. This is during the WW 2, where the attack on Pearl Harbor occured in late 1941 and the U.S. forces were engaged in a war that ended in 1945
* In 2014-2015 were born in years 1947-1950, at the height of post-WW2 baby boom in U.S.

Now we'll explore the trend in gender.

In [28]:
gender_2015 = formatted.drop('2010', 'Change', 'Percent Change').where('AGE', 999)
gender_2015

SEX,AGE,2015
0,999,321418820
1,999,158229297
2,999,163189523


In [34]:
#Now we calculate the proportion in percentage
gender_2015.with_column(
'Proportion', 
    gender_2015.column('2015') / 
    gender_2015.column('2015').item(0)).set_format('Proportion', PercentFormatter)

SEX,AGE,2015,Proportion
0,999,321418820,100.00%
1,999,158229297,49.23%
2,999,163189523,50.77%
