# County Demographics
---
- Sourced from colorado census data, and population data

In [1]:
%run workspace.py

INDEX = ['year', 'county']

## Population

> This data will supplement our census data in the next step of data prep. We're using the Population dataset because it's more accurate (census is just estimates), and it lets us create the age grouping ourselves. There are nearly 400,000 rows, because they give us population by year, county, and EACH individual age. In our case, we want to create an age grouping that separates students in school, and adults. So we chose >= 19, and < 19. It also has a 60 year timeframe. So number of rows = 60 years * 64 counties * 90 years of age.

In [2]:
df_raw = read_raw('''
SELECT
    year,
    UPPER(county) AS county,
    age,
    malePopulation AS male,
    femalePopulation AS female,
    totalPopulation AS total
FROM county_population;
''')
head(df_raw)

6 cols x 381504 rows


Unnamed: 0,year,county,age,male,female,total
0,1990,ADAMS,0,2354,2404,4758
1,1990,ADAMS,1,2345,2375,4720
2,1990,ADAMS,2,2413,2219,4632


### Age groups (< 19, >= 19)

In [3]:
df = df_raw.coerce_type(float, exclude=['year'])
df['age_range'] = "over18"
df.loc[df.age <= 18, 'age_range'] = 'under19'
df = df.drop_cols('age')
head(df)

df = (df
    .groupby(INDEX + ['age_range'])
    .sum()
    .reset_index()
)
df_grouped = df
head(df_grouped)

6 cols x 381504 rows


Unnamed: 0,year,county,male,female,total,age_range
0,1990,ADAMS,2354.0,2404.0,4758.0,under19
1,1990,ADAMS,2345.0,2375.0,4720.0,under19
2,1990,ADAMS,2413.0,2219.0,4632.0,under19


6 cols x 7808 rows


Unnamed: 0,year,county,age_range,male,female,total
0,1990,ADAMS,over18,90383.0,94282.0,184665.0
1,1990,ADAMS,under19,41519.0,39525.0,81044.0
2,1990,ALAMOSA,over18,4488.0,4823.0,9311.0


### Notice the `age_range` column. We should pivot those values out to their own columns, and mix with our existing columns
- First, pivot age_range into the male, female, and total columns
- We're left with a multilevel column index, so we drop a level and rename everything by hand.
- Lastly, restore the total, male, and female columns since they got split in half when pivoting.

In [4]:
df = (df_grouped
    .pivot(
        index=INDEX,
        columns='age_range',
        values=['male', 'female', 'total']
    )
    .reset_index()
    .flatten_multi_level_cols()
    .set_columns(INDEX + ['over18', 'under19', 'under19_male', 'under19_female', 'over18_male', 'over18_female'])
)

df.insert_at(2, 'female', df.under19_female + df.over18_female)
df.insert_at(2, 'male', df.under19_male + df.over18_male)
df.insert_at(2, 'total', df.under19 + df.over18)

pop_raw = df
head(pop_raw)

11 cols x 3904 rows


Unnamed: 0,year,county,total,male,female,over18,under19,under19_male,under19_female,over18_male,over18_female
0,1990,ADAMS,131902.0,278947.0,120569.0,90383.0,41519.0,94282.0,39525.0,184665.0,81044.0
1,1990,ALAMOSA,6677.0,14134.0,6423.0,4488.0,2189.0,4823.0,2117.0,9311.0,4306.0
2,1990,ARAPAHOE,191722.0,428121.0,166735.0,134481.0,57241.0,146820.0,54747.0,281301.0,111988.0


In [5]:
write_main(pop_raw, 'county_population')

3904

---
---
---

## Census Field Descriptions
---
- To supplement the "Census Counties ..." datasets, they've provided us a table with descriptions of each column name, for each historical standard of the census. Fortunately, the 2019 and 2012 census data (that's what we're using) uses the same standard: `acs_standard`
- This script does the following:
  - Filters the source dataframe to only include `acs_standard` column descriptions
  - Selects only necessary columns (column name, description)
  - Renames some values in column name, and removes some column name values we'll never use (geonum, geojson)

In [6]:
desc = read_raw('''
SELECT
    apifieldname AS field_name,
    description
FROM census_counties_field_desc
WHERE type = 'acs_standard'
AND field_name NOT IN ('geonum', 'geojson')
''')
desc.loc[desc.field_name == 'geoname', 'field_name'] = 'county'

write_main(desc, 'census_counties_field_desc')

head(desc)

2 cols x 155 rows


Unnamed: 0,field_name,description
0,county,Geographic Area common name
1,pop,Population Estimate for the given time range
2,hispanic,Estimate for the Hispanic Population


---
---
---

## Census data

In [7]:
# Each census year comes in a separate dataset
dem = (
    pd.concat([
            (read_raw(f'select * from census_counties_{year}')
                .drop_cols(['pop', 'geonum', 'the_geom'])
                .assign(year=year)
                .rename_col('civ_ni_','civ_ni_p')
            )
        for year in range(2012, 2020)
    ])
    .copy() # avoid fragmentation caused by assign()
    .rename_col('geoname','county')
    .move_col('year', 0)
    .coerce_type(float, exclude='year')
)

dem.county = (
    dem.county
    .str.replace(" County, Colorado", "")
    .str.upper()
)
head(dem)

155 cols x 512 rows


Unnamed: 0,year,county,hispanic,white_nh,black_nh,ntvam_nh,asian_nh,hawpi_nh,other_nh,twoplus_nh,...,civ_ni_pop,disabled,pop16_pls,laborforce,civ_lf,emp,unemp,armedfrcs,not_lf,civ_ni_p
0,2012,ARAPAHOE,105174.0,364766.0,55629.0,2211.0,28067.0,1166.0,1267.0,16077.0,...,568663.0,49870.0,444215.0,320199.0,318041.0,292089.0,25952.0,2158.0,124016.0,568663.0
1,2012,MINERAL,15.0,671.0,9.0,5.0,1.0,0.0,0.0,1.0,...,702.0,129.0,681.0,391.0,391.0,370.0,21.0,0.0,290.0,702.0
2,2012,MONTROSE,8037.0,31799.0,186.0,74.0,227.0,49.0,33.0,589.0,...,40552.0,5649.0,32334.0,20137.0,20124.0,18110.0,2014.0,13.0,12197.0,40552.0


## Combine population data and census data
- Looking in the original demographic data, most population groups are present: gender, age, etc. So why not use those?
- A couple reasons.
  - The population dataset is likely more accurate, claiming to provide "actual" numbers, whereas the census data provides "estimates"
  - The population dataset is more precise, with age groups of each individual age number, allowing us to make our own aggregated bins (adult, minor). The census data has age groups defined already, but in increments of 5, so the middle group is "15 to 19", but we need 18 and under!
  - The population dataset offers sub-aggregations: we have `minor_female` and `minor_male`, for instance, whereas the census data only offers age populations and gender populations separately
- So instead, we will use population dataset first, and add in additional groups from census data
---

#### Select desired columns from census data

In [8]:
df = dem.copy()[INDEX + [
    'med_age',
    'households', 'avghhsize',
    'civ_lf', 'emp', 'unemp',
    'hispanic', 'white_nh', 'black_nh', 'asian_nh', 'ntvam_nh', 'hawpi_nh', 'other_nh', 'twoplus_nh',
    'pop25plus', 'hsgrad_sc',
    'med_hh_inc', 'per_cap_in',
    'citz_birth', 'citz_nat', 'born_in_co',
    'pop_3pl', 'enrolled', 'undergrad',
    'gr_1_4', 'gr_5_8', 'gr_9_12',
    'med_hm_val', 'med_yr_blt',
    'housing_un', 'occ_hu',
    'own_occ_hu', 'v_l_50k', 'v50k_100k', 'v100k_150k', 'v150k_200k', 'v200k_250k', 'v250k_300k',
    'v300k_400k', 'v400k_500k', 'v500k_750k', 'v750k_1m', 'v_1m_plus',
    'b2000_2009', 'b1990_1999', 'b1980_1989', 'b1970_1979',
    'b1960_1969', 'b1950_1959', 'b1940_1949', 'b1939_e',
    'ps_uni', 'ps_below',
    'tot_l18', 'pov_l18',
]]

#### Group bins together

In [9]:
# Create new variable for total citizens. Place it next to citz_birth
df = (df
    .insert_at('citz_birth', 'citz', df.citz_birth + df.citz_nat)
    .drop_cols('citz_nat')
    .combine_cols(items={
        'race_other': ['ntvam_nh', 'hawpi_nh', 'other_nh', 'twoplus_nh'],
        'b1949_e': ['b1939_e', 'b1940_1949'],
        'v50k_150k':  ['v50k_100k', 'v100k_150k'],
        'v150k_250k': ['v150k_200k', 'v200k_250k'],
        'v250k_400k': ['v250k_300k', 'v300k_400k'],
        'v400k_750k': ['v400k_500k', 'v500k_750k'],
        'v750k_plus': ['v750k_1m', 'v_1m_plus'],
    })
)
df

Unnamed: 0,year,county,med_age,households,avghhsize,civ_lf,emp,unemp,hispanic,white_nh,...,b1990_1999,b1980_1989,b1970_1979,b1960_1969,b1950_1959,b1949_e,ps_uni,ps_below,tot_l18,pov_l18
0,2012,ARAPAHOE,35.7,223747.0,2.55,318041.0,292089.0,25952.0,105174.0,364766.0,...,33989.0,56011.0,62253.0,22258.0,16519.0,7165.0,568999.0,66945.0,144576.0,23054.0
1,2012,MINERAL,60.3,363.0,1.83,391.0,370.0,21.0,15.0,671.0,...,232.0,239.0,203.0,100.0,75.0,240.0,702.0,47.0,26.0,0.0
2,2012,MONTROSE,42.6,16732.0,2.41,20124.0,18110.0,2014.0,8037.0,31799.0,...,3750.0,2106.0,3581.0,1298.0,920.0,2333.0,40368.0,5565.0,9788.0,1927.0
3,2012,PARK,47.0,6997.0,2.29,9583.0,8796.0,787.0,777.0,14818.0,...,3567.0,2374.0,2952.0,1051.0,693.0,939.0,16049.0,1355.0,3049.0,276.0
4,2012,MORGAN,36.0,10489.0,2.62,13786.0,12758.0,1028.0,9557.0,17399.0,...,1195.0,984.0,2282.0,1078.0,1858.0,2703.0,27416.0,4002.0,7670.0,1454.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,2019,CONEJOS,38.8,3183.0,2.53,3383.0,3029.0,354.0,4263.0,3622.0,...,694.0,565.0,665.0,342.0,265.0,1358.0,8089.0,1756.0,2164.0,636.0
60,2019,ADAMS,33.8,166450.0,3.00,272810.0,261893.0,10917.0,201784.0,252170.0,...,28459.0,22069.0,29342.0,17026.0,19373.0,6023.0,499315.0,54159.0,134212.0,19943.0
61,2019,EAGLE,37.0,18171.0,3.00,34775.0,34128.0,647.0,16179.0,36748.0,...,9377.0,7555.0,6022.0,907.0,318.0,934.0,54401.0,4354.0,11805.0,1184.0
62,2019,MOFFAT,36.6,5366.0,2.42,6445.0,6163.0,282.0,2044.0,10543.0,...,804.0,751.0,2168.0,567.0,472.0,761.0,13003.0,2206.0,3361.0,680.0


#### Create nominal variables for housing price and housing age
- First, create a categorical variable whose values are the COLUMN NAME of the bin with the max value. For instance, if a given county has more houses in the `v50k_100k` range than any other range, the value at that row in the new column will be "v50k_100k"
- Next, create a nominal column from that categorical column, ordered so that a lower number means less desirable. So for prices, "v_l_50k" -> 1, and for year built, "b1939_e" -> 1

In [10]:
blt_ascending = ['b1949_e','b1950_1959','b1960_1969','b1970_1979','b1980_1989','b1990_1999','b2000_2009']
prices_ascending = ['v_l_50k', 'v50k_150k', 'v150k_250k', 'v250k_400k', 'v400k_750k', 'v750k_plus']
df = (df
    .add_binmax('blt_freq_yr', blt_ascending)
    .add_ordinal('blt_freq_yr', blt_ascending)
    .add_binmax('hu_freq_val', prices_ascending)
    .add_ordinal('hu_freq_val', prices_ascending)
)

---

#### Rename everything, with a naming system that let's us easily select sub-groups of columns with a simple string match

In [11]:
# If you're wondering why we're doing all this renaming, look at the
# beginning of each new name. Notice a pattern?
pop = pop_raw.rename(columns={
    'total':            'pop',

    'male':             'gend_m',
    'female':           'gend_f',

    'over18':           'age_over18',
    'under19':          'age_undr19',

    'over18_male':      'gend_m_age_over18',
    'over18_female':    'gend_f_age_over18',
    'under19_male':     'gend_m_age_undr19',
    'under19_female':   'gend_f_age_undr19',
})
df = df.rename(columns={
    'med_age':      'age_median',

    'per_cap_in':   'inc_per_cap',
    'med_hh_inc':   'inc_hh_median',

    'households':   'hh',
    'avghhsize':    'hh_size_avg',

    'pop25plus':    'hsgrad_pool',
    'hsgrad_sc':    'hsgrad_graduated',

    'born_in_co':   'citz_co',
    'citz_birth':   'citz_birth',

    'emp':          'civ_lf_employed',

    'hispanic':     'race_hispanic',
    'white_nh':     'race_white',
    'black_nh':     'race_black',
    'asian_nh':     'race_asian',

    'ps_uni':       'ps_known',
    'ps_below':     'ps_below',
    'tot_l18':      'ps_undr18_known',
    'pov_l18':      'ps_undr18_below',

    'pop_3pl':      'stud_enroll_pool',
    'enrolled':     'stud_enrolled',
    'undergrad':    'stud_undergrad',
    'gr_1_4':       'stud_1_4',
    'gr_5_8':       'stud_5_8',
    'gr_9_12':      'stud_9_12',

    'housing_un':   'hu',
    'occ_hu':       'hu_occ',

    'blt_freq_yr':  'hu_blt_freq_yr',
    'blt_freq_yr_ord':'hu_blt_freq_yr_ord',
    'b1949_e':      'hu_blt_lt_1950',
    'b1950_1959':   'hu_blt_1950_1959',
    'b1960_1969':   'hu_blt_1960_1969',
    'b1970_1979':   'hu_blt_1970_1979',
    'b1980_1989':   'hu_blt_1980_1989',
    'b1990_1999':   'hu_blt_1990_1999',
    'b2000_2009':   'hu_blt_2000_plus',

    'own_occ_hu':   'hu_oo',
    'hu_freq_val':  'hu_oo_freq_val',
    'hu_freq_val_ord':'hu_oo_freq_val_ord',
    'v_l_50k':      'hu_oo_lt_50',
    'v50k_150k':    'hu_oo_50_150',
    'v150k_250k':   'hu_oo_150_250',
    'v250k_400k':   'hu_oo_250_400',
    'v400k_750k':   'hu_oo_400_750',
    'v750k_plus':   'hu_oo_750_plus',
})

## Merge population and census data

In [12]:
main = (pop
    .merge(df, on=INDEX)
    .move_col('age_median', 'age_over18')
)
head(main)

61 cols x 512 rows


Unnamed: 0,year,county,pop,gend_m,gend_f,age_median,age_over18,age_undr19,gend_m_age_undr19,gend_f_age_undr19,...,hu_blt_1970_1979,hu_blt_1960_1969,hu_blt_1950_1959,hu_blt_freq_yr_ord,hu_blt_freq_yr,hu_blt_lt_1950,ps_known,ps_below,ps_undr18_known,ps_undr18_below
0,2012,ADAMS,231571.0,487410.0,201960.0,32.4,162109.0,69462.0,162653.0,66249.0,...,30185.0,19615.0,20369.0,7,b2000_2009,6158.0,438171.0,62008.0,124375.0,25278.0
1,2012,ALAMOSA,7823.0,17115.0,6283.0,32.2,5622.0,2201.0,5748.0,2044.0,...,1405.0,654.0,591.0,1,b1949_e,1536.0,14622.0,3191.0,3817.0,758.0
2,2012,ARAPAHOE,292548.0,666719.0,233180.0,35.7,212207.0,80341.0,227254.0,76419.0,...,62253.0,22258.0,16519.0,4,b1970_1979,7165.0,568999.0,66945.0,144576.0,23054.0


## Calculations for groups
---

In [13]:
from grouped_df import GroupedDF
GroupedDF.default_index = INDEX
GroupedDF.set_groups(['age', 'gend', 'race', 'inc', 'hh', 'citz', 'hsgrad', 'civ_lf', 'ps', 'stud', 'hu', 'hu_blt', 'hu_oo'])

In [14]:
gd = GroupedDF(main, INDEX, custom={'hu': INDEX + ['hu', 'hu_occ']})
# gd.df
gd.display(5)

age: 


Unnamed: 0,year,county,age_median,age_over18,age_undr19
0,2012,ADAMS,32.4,162109.0,69462.0
1,2012,ALAMOSA,32.2,5622.0,2201.0
2,2012,ARAPAHOE,35.7,212207.0,80341.0
3,2012,ARCHULETA,47.5,4736.0,1297.0
4,2012,BACA,47.8,1414.0,422.0



gend: 


Unnamed: 0,year,county,gend_m,gend_f,gend_m_age_undr19,gend_f_age_undr19,gend_m_age_over18,gend_f_age_over18
0,2012,ADAMS,487410.0,201960.0,162653.0,66249.0,324757.0,135711.0
1,2012,ALAMOSA,17115.0,6283.0,5748.0,2044.0,11367.0,4239.0
2,2012,ARAPAHOE,666719.0,233180.0,227254.0,76419.0,439465.0,156761.0
3,2012,ARCHULETA,14349.0,3600.0,4810.0,1150.0,9539.0,2450.0
4,2012,BACA,4332.0,1239.0,1460.0,409.0,2872.0,830.0



race: 


Unnamed: 0,year,county,race_hispanic,race_white,race_black,race_asian,race_other
0,2012,ADAMS,167556.0,235991.0,12970.0,15304.0,11175.0
1,2012,ALAMOSA,7185.0,7767.0,110.0,59.0,629.0
2,2012,ARAPAHOE,105174.0,364766.0,55629.0,28067.0,20721.0
3,2012,ARCHULETA,2157.0,9493.0,9.0,117.0,333.0
4,2012,BACA,347.0,3311.0,16.0,30.0,79.0



inc: 


Unnamed: 0,year,county,inc_hh_median,inc_per_cap
0,2012,ADAMS,56633.0,24357.0
1,2012,ALAMOSA,38045.0,19657.0
2,2012,ARAPAHOE,60400.0,32845.0
3,2012,ARCHULETA,54007.0,29771.0
4,2012,BACA,39497.0,22436.0



hh: 


Unnamed: 0,year,county,hh,hh_size_avg
0,2012,ADAMS,151034.0,2.91
1,2012,ALAMOSA,5853.0,2.49
2,2012,ARAPAHOE,223747.0,2.55
3,2012,ARCHULETA,4536.0,2.64
4,2012,BACA,1675.0,2.18



citz: 


Unnamed: 0,year,county,citz,citz_birth,citz_co
0,2012,ADAMS,396172.0,376454.0,223907.0
1,2012,ALAMOSA,15122.0,14868.0,9542.0
2,2012,ARAPAHOE,519940.0,487576.0,223433.0
3,2012,ARCHULETA,11924.0,11729.0,3411.0
4,2012,BACA,3717.0,3654.0,1996.0



hsgrad: 


Unnamed: 0,year,county,hsgrad_pool,hsgrad_graduated
0,2012,ADAMS,275628.0,166731.0
1,2012,ALAMOSA,9424.0,5946.0
2,2012,ARAPAHOE,378792.0,199197.0
3,2012,ARCHULETA,8659.0,4882.0
4,2012,BACA,2769.0,1909.0



civ_lf: 


Unnamed: 0,year,county,civ_lf,civ_lf_employed
0,2012,ADAMS,236110.0,213794.0
1,2012,ALAMOSA,7171.0,6449.0
2,2012,ARAPAHOE,318041.0,292089.0
3,2012,ARCHULETA,6124.0,5444.0
4,2012,BACA,1876.0,1827.0



ps: 


Unnamed: 0,year,county,ps_known,ps_below,ps_undr18_known,ps_undr18_below
0,2012,ADAMS,438171.0,62008.0,124375.0,25278.0
1,2012,ALAMOSA,14622.0,3191.0,3817.0,758.0
2,2012,ARAPAHOE,568999.0,66945.0,144576.0,23054.0
3,2012,ARCHULETA,11989.0,1051.0,2386.0,359.0
4,2012,BACA,3649.0,530.0,824.0,139.0



stud: 


Unnamed: 0,year,county,stud_enroll_pool,stud_enrolled,stud_undergrad,stud_1_4,stud_5_8,stud_9_12
0,2012,ADAMS,420756.0,117499.0,19299.0,28761.0,26645.0,24342.0
1,2012,ALAMOSA,14903.0,5362.0,2285.0,736.0,801.0,890.0
2,2012,ARAPAHOE,549701.0,153854.0,29388.0,33703.0,30902.0,33425.0
3,2012,ARCHULETA,11866.0,2588.0,228.0,494.0,748.0,789.0
4,2012,BACA,3663.0,749.0,52.0,126.0,221.0,215.0



hu: 


Unnamed: 0,year,county,hu,hu_occ
0,2012,ADAMS,163245.0,151034.0
1,2012,ALAMOSA,6572.0,5853.0
2,2012,ARAPAHOE,238160.0,223747.0
3,2012,ARCHULETA,8742.0,4536.0
4,2012,BACA,2253.0,1675.0



hu_blt: 


Unnamed: 0,year,county,hu_blt_2000_plus,hu_blt_1990_1999,hu_blt_1980_1989,hu_blt_1970_1979,hu_blt_1960_1969,hu_blt_1950_1959,hu_blt_freq_yr_ord,hu_blt_freq_yr,hu_blt_lt_1950
0,2012,ADAMS,38682.0,27598.0,20368.0,30185.0,19615.0,20369.0,7,b2000_2009,6158.0
1,2012,ALAMOSA,650.0,866.0,862.0,1405.0,654.0,591.0,1,b1949_e,1536.0
2,2012,ARAPAHOE,39415.0,33989.0,56011.0,62253.0,22258.0,16519.0,4,b1970_1979,7165.0
3,2012,ARCHULETA,2204.0,2186.0,2054.0,1384.0,326.0,124.0,7,b2000_2009,415.0
4,2012,BACA,46.0,172.0,172.0,470.0,284.0,306.0,1,b1949_e,803.0



hu_oo: 


Unnamed: 0,year,county,hu_oo,hu_oo_freq_val_ord,hu_oo_freq_val,hu_oo_lt_50,hu_oo_50_150,hu_oo_150_250,hu_oo_250_400,hu_oo_400_750,hu_oo_750_plus
0,2012,ADAMS,100108.0,3,v150k_250k,8578.0,19838.0,47583.0,17779.0,5427.0,903.0
1,2012,ALAMOSA,3702.0,2,v50k_150k,435.0,1599.0,1077.0,397.0,177.0,17.0
2,2012,ARAPAHOE,143158.0,3,v150k_250k,4207.0,22174.0,55935.0,38213.0,16339.0,6290.0
3,2012,ARCHULETA,3532.0,4,v250k_400k,152.0,513.0,781.0,1153.0,612.0,321.0
4,2012,BACA,1236.0,2,v50k_150k,399.0,601.0,144.0,49.0,23.0,20.0





## Calculations
---

- **age, and gend**
  - `age_median`: (Existing)
  - `age_undr19_prop`: What percent of the population is under 19?
  - `gend_m_prop`: What percent of the population is male?
  - `age_undr19_gend_m_prop`: What percent of under-19 year old are male? (divide m_undr19 by undr19)
- **inc**
  - `inc_hh_med`: (Existing) Median household income
  - `inc_per_cap`: (Existing) Per capita income
- **hh**
  - `hh_size_avg`: (Existing) Average household size
- **race**
  - `race_{x}_prop`: What percent of the population is race x?
  - `race_prop_stdev`: What is the standard deviation of the race proportions? We need to calculate the proportions first, to normalize for the population size, that way, we can compare the standard deviations across groups
- **hsgrad**
  - `hsgrad_graduated_prop`: What percent of adults (age 25+) have a high school diploma or equivalent?
- **civ_lf**
  - `civ_lf_prop`: What percent of the population is in the civilian labor force?
  - `civ_lf_employed_prop`: What percent of the civilian labor force is employed?
- **ps**
  - `ps_total_prop`: What percent of people whose poverty status is known are below the poverty line?
  - `ps_undr18_total_prop`: What percent of under-18 people whose poverty status is known are below the poverty line?
  - `ps_undr18_prop`: What percent of people below the poverty line are under 18?
- **stud**
  - `stud_enrolled_prop`: Percent of people who could be enrolled in school that actually are enrolled
  - `stud_hs_prop`: What percent of gradeschool students (1-12) are high schoolers? (lower number indicates dropouts, which may associate with crime)
  - `stud_undergrad_prop`: What percent of enrolled students are undergraduates?
- **citz**
  - `citz_prop`: What percent of the population is a us citizen?
  - `citz_birth_prop`: What percent of us citizens were born in the us?
  - `citz_co_prop`: What percent of citizens were born in Colorado?
- **hu**
  - `hu_occ_prop`: Percent of homes which are occupied
  - `hu_blt_after1989`: Percent of homes which were built in the past 20 years
  - `hu_blt_nominal`: Convert hu_blt_mode_range to nominal, where the highest number corresponds to highest year range
- **hu_oo**
  - `hu_oo_prop`: Percent of occupied properties occupied by owner. The remaining percent is renter occupied
  - `hu_oo_lt_50_prop`: Percent of owner occupied properties worth less than $50,000
  - `hu_oo_750_plus_prop`: Percent of owner occupied properties worth $750,000 or more


In [15]:
df = main.copy()

df['age_over18_prop'] = df.age_over18 / df['pop']
df['age_undr19_prop'] = df.age_undr19 / df['pop']
df['gend_m_prop'] = df.gend_m / df['pop']
df['gend_f_prop'] = df.gend_f / df['pop']
df['age_undr19_gend_m_prop'] = df.gend_m_age_undr19 / df.age_undr19
df['age_undr19_gend_f_prop'] = df.gend_f_age_undr19 / df.age_undr19
df['age_over18_gend_m_prop'] = df.gend_m_age_over18 / df.age_over18
df['age_over18_gend_f_prop'] = df.gend_f_age_over18 / df.age_over18

df['gend_m_age_undr19_prop'] = df.gend_m_age_undr19 / df.gend_m
df['gend_m_age_over18_prop'] = df.gend_m_age_over18 / df.gend_m
df['gend_f_age_undr19_prop'] = df.gend_f_age_undr19 / df.gend_f
df['gend_f_age_over18_prop'] = df.gend_f_age_over18 / df.gend_f

race_base = GroupedDF(df, INDEX).race
race = df.copy()[INDEX]
for c in [c for c in race_base.columns if c not in INDEX]:
    race[f'{c}_prop'] = race_base[c] / df['pop']

race['race_prop_stdev'] = np.std(race.drop(columns=INDEX), axis=1)
df = df.merge(race, how='inner', on=INDEX)

df['hsgrad_graduated_prop'] = df.hsgrad_graduated / df.hsgrad_pool

df['civ_lf_prop'] = df.civ_lf / df['pop']
df['civ_lf_employed_prop'] = df.civ_lf_employed / df.civ_lf

df['ps_total_prop'] = df.ps_below / df.ps_known
df['ps_undr18_total_prop'] = df.ps_undr18_below / df.ps_undr18_known
df['ps_undr18_prop'] = df.ps_undr18_below / df.ps_below

df['stud_enrolled_prop'] = df.stud_enrolled / df.stud_enroll_pool
df['stud_hs_prop'] = df.stud_9_12 / (df.stud_1_4 + df.stud_5_8 + df.stud_9_12)
df['stud_undergrad_prop'] = df.stud_undergrad / df.stud_enrolled

df['citz_per_cap'] = df.citz / df['pop']
df['citz_birth_prop'] = df.citz_birth / df.citz
df['citz_co_prop'] = df.citz_co / df.citz

df['hu_per_cap'] = df.hu / df['pop']
df['hu_occ_prop'] = df.hu_occ / df.hu
df['hu_blt_2000_plus_prop'] = df.hu_blt_2000_plus / df.hu

df['hu_oo_prop'] = df.hu_oo / df.hu_occ

for hval in ['hu_oo_lt_50', 'hu_oo_50_150', 'hu_oo_150_250', 'hu_oo_250_400', 'hu_oo_400_750', 'hu_oo_750_plus']:
    df[f'{hval}_prop'] = df[hval] / df.hu_oo

for hyear in [
        'hu_blt_lt_1950', 'hu_blt_1950_1959', 'hu_blt_1960_1969',
        'hu_blt_1970_1979', 'hu_blt_1980_1989', 'hu_blt_1990_1999', 'hu_blt_2000_plus'
    ]:
    df[f'{hyear}_prop'] = df[hyear] / df.hu

prop, counts = separate_by(df, ['prop', 'per_cap', 'median', 'avg', 'freq', 'med_hm_val', 'med_yr_blt'], index=INDEX)

write_main(prop, 'county_stats_normalized')
write_main(counts, 'county_stats_counts')
write_main(df, 'county_stats')

gprop = GroupedDF(prop, INDEX, custom={'hu': INDEX + ['hu_per_cap', 'hu_occ_prop']})
gprop.display()

age: 


Unnamed: 0,year,county,age_over18_prop,age_undr19_prop,age_undr19_gend_m_prop,age_undr19_gend_f_prop,age_over18_gend_m_prop,age_over18_gend_f_prop,age_median
0,2012,ADAMS,0.70004,0.29996,2.341611,0.953744,2.003325,0.837159,32.4
1,2012,ALAMOSA,0.71865,0.28135,2.61154,0.928669,2.021878,0.754002,32.2
2,2012,ARAPAHOE,0.725375,0.274625,2.828618,0.951183,2.070926,0.738717,35.7



gend: 


Unnamed: 0,year,county,gend_m_prop,gend_f_prop,gend_m_age_undr19_prop,gend_m_age_over18_prop,gend_f_age_undr19_prop,gend_f_age_over18_prop
0,2012,ADAMS,2.104797,0.87213,0.333709,0.666291,0.32803,0.67197
1,2012,ALAMOSA,2.18778,0.803145,0.335846,0.664154,0.325322,0.674678
2,2012,ARAPAHOE,2.279007,0.797066,0.340854,0.659146,0.327725,0.672275



race: 


Unnamed: 0,year,county,race_hispanic_prop,race_white_prop,race_black_prop,race_asian_prop,race_other_prop,race_prop_stdev
0,2012,ADAMS,0.723562,1.019087,0.056009,0.066088,0.048257,0.409877
1,2012,ALAMOSA,0.918446,0.992842,0.014061,0.007542,0.080404,0.452841
2,2012,ARAPAHOE,0.35951,1.246859,0.190153,0.09594,0.070829,0.438949



inc: 


Unnamed: 0,year,county,inc_per_cap,inc_hh_median
0,2012,ADAMS,24357.0,56633.0
1,2012,ALAMOSA,19657.0,38045.0
2,2012,ARAPAHOE,32845.0,60400.0



hh: 


Unnamed: 0,year,county,hh_size_avg
0,2012,ADAMS,2.91
1,2012,ALAMOSA,2.49
2,2012,ARAPAHOE,2.55



citz: 


Unnamed: 0,year,county,citz_birth_prop,citz_co_prop,citz_per_cap
0,2012,ADAMS,0.950229,0.565176,1.710801
1,2012,ALAMOSA,0.983203,0.631001,1.933018
2,2012,ARAPAHOE,0.937754,0.429728,1.777281



hsgrad: 


Unnamed: 0,year,county,hsgrad_graduated_prop
0,2012,ADAMS,0.604913
1,2012,ALAMOSA,0.630942
2,2012,ARAPAHOE,0.525874



civ_lf: 


Unnamed: 0,year,county,civ_lf_prop,civ_lf_employed_prop
0,2012,ADAMS,1.019601,0.905485
1,2012,ALAMOSA,0.916656,0.899317
2,2012,ARAPAHOE,1.087141,0.9184



ps: 


Unnamed: 0,year,county,ps_total_prop,ps_undr18_total_prop,ps_undr18_prop
0,2012,ADAMS,0.141516,0.20324,0.407657
1,2012,ALAMOSA,0.218233,0.198585,0.237543
2,2012,ARAPAHOE,0.117654,0.159459,0.344372



stud: 


Unnamed: 0,year,county,stud_enrolled_prop,stud_hs_prop,stud_undergrad_prop
0,2012,ADAMS,0.279257,0.305236,0.164248
1,2012,ALAMOSA,0.359793,0.366708,0.426147
2,2012,ARAPAHOE,0.279887,0.340967,0.191012



hu: 


Unnamed: 0,year,county,hu_per_cap,hu_occ_prop
0,2012,ADAMS,0.704946,0.925198
1,2012,ALAMOSA,0.840087,0.890596
2,2012,ARAPAHOE,0.814089,0.939482



hu_blt: 


Unnamed: 0,year,county,hu_blt_2000_plus_prop,hu_blt_lt_1950_prop,hu_blt_1950_1959_prop,hu_blt_1960_1969_prop,hu_blt_1970_1979_prop,hu_blt_1980_1989_prop,hu_blt_1990_1999_prop,hu_blt_freq_yr_ord,hu_blt_freq_yr
0,2012,ADAMS,0.236957,0.037722,0.124776,0.120157,0.184906,0.12477,0.169059,7,b2000_2009
1,2012,ALAMOSA,0.098904,0.233719,0.089927,0.099513,0.213786,0.131163,0.131771,1,b1949_e
2,2012,ARAPAHOE,0.165498,0.030085,0.069361,0.093458,0.261392,0.235182,0.142715,4,b1970_1979



hu_oo: 


Unnamed: 0,year,county,hu_oo_prop,hu_oo_lt_50_prop,hu_oo_50_150_prop,hu_oo_150_250_prop,hu_oo_250_400_prop,hu_oo_400_750_prop,hu_oo_750_plus_prop,hu_oo_freq_val_ord,hu_oo_freq_val
0,2012,ADAMS,0.662818,0.085687,0.198166,0.475317,0.177598,0.054211,0.00902,3,v150k_250k
1,2012,ALAMOSA,0.632496,0.117504,0.431929,0.290924,0.107239,0.047812,0.004592,2,v50k_150k
2,2012,ARAPAHOE,0.639821,0.029387,0.154892,0.390722,0.266929,0.114133,0.043937,3,v150k_250k



