## Processing the Census Data By Subject Table at Texas Tract Level

Note: Although the analysis of these tables can be done purely in jupyter it is very helpful to download the actual 
      spreadsheets for visual clarity, at least the metadata file.

In [24]:
import os
import pandas as pd
pd.set_option('display.max_colwidth', None)

### First table is S0101 - Age and Sex

In [2]:
data_dir = 'data/subject_tables/unzipped_files/ACSST5Y2019.S0101_2021-01-12T120250'

In [3]:
S0101_data = pd.read_csv(os.path.join(data_dir, 'ACSST5Y2019.S0101_data_with_overlays_2021-01-08T174020.csv'), low_memory=False)
S0101_metadata = pd.read_csv(os.path.join(data_dir, 'ACSST5Y2019.S0101_metadata_2021-01-08T174020.csv'))

In [4]:
print(f'Shape of data table proper is {S0101_data.shape} \n'
    + f'Shape of metadata table is {S0101_metadata.shape}')

Shape of data table proper is (5266, 458) 
Shape of metadata table is (457, 2)


In [5]:
S0101_data.head(2)

Unnamed: 0,GEO_ID,NAME,S0101_C01_001E,S0101_C01_001M,S0101_C01_002E,S0101_C01_002M,S0101_C01_003E,S0101_C01_003M,S0101_C01_004E,S0101_C01_004M,...,S0101_C06_034E,S0101_C06_034M,S0101_C06_035E,S0101_C06_035M,S0101_C06_036E,S0101_C06_036M,S0101_C06_037E,S0101_C06_037M,S0101_C06_038E,S0101_C06_038M
0,id,Geographic Area Name,Estimate!!Total!!Total population,Margin of Error!!Total!!Total population,Estimate!!Total!!Total population!!AGE!!Under ...,Margin of Error!!Total!!Total population!!AGE!...,Estimate!!Total!!Total population!!AGE!!5 to 9...,Margin of Error!!Total!!Total population!!AGE!...,Estimate!!Total!!Total population!!AGE!!10 to ...,Margin of Error!!Total!!Total population!!AGE!...,...,Estimate!!Percent Female!!Total population!!SU...,Margin of Error!!Percent Female!!Total populat...,Estimate!!Percent Female!!Total population!!SU...,Margin of Error!!Percent Female!!Total populat...,Estimate!!Percent Female!!Total population!!SU...,Margin of Error!!Percent Female!!Total populat...,Estimate!!Percent Female!!Total population!!PE...,Margin of Error!!Percent Female!!Total populat...,Estimate!!Percent Female!!Total population!!PE...,Margin of Error!!Percent Female!!Total populat...
1,1400000US48001950100,"Census Tract 9501, Anderson County, Texas",4844,524,349,131,269,119,372,137,...,(X),(X),(X),(X),(X),(X),(X),(X),(X),(X)


In [6]:
S0101_metadata.iloc[:2]

Unnamed: 0,GEO_ID,id
0,NAME,Geographic Area Name
1,S0101_C01_001E,Estimate!!Total!!Total population


In [7]:
# Breaking apart GEO_ID column into the FIPS components and creating new columns for them.
# Triple index due to: 1st index specify Series column, second specify component of our split list, 3rd is grabbing correct digits.
S0101_data["STATE"] = S0101_data.GEO_ID.str.split('US')[1][1][:2]
S0101_data["COUNTYFP"] = S0101_data.GEO_ID.str.split('US')[1][1][2:5]
S0101_data["TRACTCE"] = S0101_data.GEO_ID.str.split('US')[1][1][5:]

In [8]:
S0101_data[['STATE', 'COUNTYFP', 'TRACTCE']].head(2)

Unnamed: 0,STATE,COUNTYFP,TRACTCE
0,48,1,950100
1,48,1,950100


The rows in the metadata table provide a dictionary for the columns in the actual data table, but there are some minor offsets
to be made in order to make matching indices (i.e. row 1 in metadata table corresponds to column 1).

If we drop the GEO_ID and NAME columns in the dataset and drop the first row of the metadataset then they will be aligned. We should 
break apart the GEOID field into its FIPS components for easier matching of datasets.

FIPS codes are 11 digits long. 2 state digits, 3 county digits, 6 tract digits.

In the GEO_ID column the FIPS code comes after the US characters.

In [9]:
S0101_data.drop(columns=['GEO_ID', 'NAME'], inplace=True)
S0101_metadata.drop(index=0, axis=0, inplace=True)
S0101_metadata.reset_index(drop=True, inplace=True)

# Subtracting 3 to S0101_data.shape[1] to account for the our creation of 3 new columns
assert (S0101_data.shape[1]-3 == S0101_metadata.shape[0]), print('mismatch in metadata and data correspondence')

In [25]:
S0101_metadata.head(15)

Unnamed: 0,GEO_ID,id
0,S0101_C01_001E,Estimate!!Total!!Total population
1,S0101_C01_001M,Margin of Error!!Total!!Total population
2,S0101_C01_002E,Estimate!!Total!!Total population!!AGE!!Under 5 years
3,S0101_C01_002M,Margin of Error!!Total!!Total population!!AGE!!Under 5 years
4,S0101_C01_003E,Estimate!!Total!!Total population!!AGE!!5 to 9 years
5,S0101_C01_003M,Margin of Error!!Total!!Total population!!AGE!!5 to 9 years
6,S0101_C01_004E,Estimate!!Total!!Total population!!AGE!!10 to 14 years
7,S0101_C01_004M,Margin of Error!!Total!!Total population!!AGE!!10 to 14 years
8,S0101_C01_005E,Estimate!!Total!!Total population!!AGE!!15 to 19 years
9,S0101_C01_005M,Margin of Error!!Total!!Total population!!AGE!!15 to 19 years


In [16]:
int_locations = [3, *(list(range(15,30,2))), 43, 47, 61, 155, 307] # These are the rows as they appear in metadata excel sheet
int_locations = [x-3 for x in int_locations] # Offsetting these indicies to account for drops, header and 0-index pandas scheme

In [26]:
S0101_metadata.iloc[int_locations]

Unnamed: 0,GEO_ID,id
0,S0101_C01_001E,Estimate!!Total!!Total population
12,S0101_C01_007E,Estimate!!Total!!Total population!!AGE!!25 to 29 years
14,S0101_C01_008E,Estimate!!Total!!Total population!!AGE!!30 to 34 years
16,S0101_C01_009E,Estimate!!Total!!Total population!!AGE!!35 to 39 years
18,S0101_C01_010E,Estimate!!Total!!Total population!!AGE!!40 to 44 years
20,S0101_C01_011E,Estimate!!Total!!Total population!!AGE!!45 to 49 years
22,S0101_C01_012E,Estimate!!Total!!Total population!!AGE!!50 to 54 years
24,S0101_C01_013E,Estimate!!Total!!Total population!!AGE!!55 to 59 years
26,S0101_C01_014E,Estimate!!Total!!Total population!!AGE!!60 to 64 years
40,S0101_C01_021E,Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!15 to 17 years


In [27]:
wanted_columns = S0101_metadata.GEO_ID.iloc[int_locations].to_list()

In [28]:
wanted_columns

['S0101_C01_001E',
 'S0101_C01_007E',
 'S0101_C01_008E',
 'S0101_C01_009E',
 'S0101_C01_010E',
 'S0101_C01_011E',
 'S0101_C01_012E',
 'S0101_C01_013E',
 'S0101_C01_014E',
 'S0101_C01_021E',
 'S0101_C01_023E',
 'S0101_C01_030E',
 'S0101_C03_001E',
 'S0101_C05_001E']

In [30]:
S0101_data = S0101_data[wanted_columns]

In [31]:
S0101_data.head(2)

Unnamed: 0,S0101_C01_001E,S0101_C01_007E,S0101_C01_008E,S0101_C01_009E,S0101_C01_010E,S0101_C01_011E,S0101_C01_012E,S0101_C01_013E,S0101_C01_014E,S0101_C01_021E,S0101_C01_023E,S0101_C01_030E,S0101_C03_001E,S0101_C05_001E
0,Estimate!!Total!!Total population,Estimate!!Total!!Total population!!AGE!!25 to 29 years,Estimate!!Total!!Total population!!AGE!!30 to 34 years,Estimate!!Total!!Total population!!AGE!!35 to 39 years,Estimate!!Total!!Total population!!AGE!!40 to 44 years,Estimate!!Total!!Total population!!AGE!!45 to 49 years,Estimate!!Total!!Total population!!AGE!!50 to 54 years,Estimate!!Total!!Total population!!AGE!!55 to 59 years,Estimate!!Total!!Total population!!AGE!!60 to 64 years,Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!15 to 17 years,Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!18 to 24 years,Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!65 years and over,Estimate!!Male!!Total population,Estimate!!Female!!Total population
1,4844,252,197,335,163,312,447,318,318,250,205,1057,2486,2358


In [32]:
column_dictionary = dict(zip(S0101_data.columns, S0101_data.iloc[0]))

In [33]:
column_dictionary

{'S0101_C01_001E': 'Estimate!!Total!!Total population',
 'S0101_C01_007E': 'Estimate!!Total!!Total population!!AGE!!25 to 29 years',
 'S0101_C01_008E': 'Estimate!!Total!!Total population!!AGE!!30 to 34 years',
 'S0101_C01_009E': 'Estimate!!Total!!Total population!!AGE!!35 to 39 years',
 'S0101_C01_010E': 'Estimate!!Total!!Total population!!AGE!!40 to 44 years',
 'S0101_C01_011E': 'Estimate!!Total!!Total population!!AGE!!45 to 49 years',
 'S0101_C01_012E': 'Estimate!!Total!!Total population!!AGE!!50 to 54 years',
 'S0101_C01_013E': 'Estimate!!Total!!Total population!!AGE!!55 to 59 years',
 'S0101_C01_014E': 'Estimate!!Total!!Total population!!AGE!!60 to 64 years',
 'S0101_C01_021E': 'Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!15 to 17 years',
 'S0101_C01_023E': 'Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!18 to 24 years',
 'S0101_C01_030E': 'Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!65 years and over',
 'S0101_C03_001E': 'Estimate!!Male

In [34]:
S0101_data.drop(index=0, axis=0, inplace=True)

In [36]:
S0101_data.reset_index(drop=True, inplace=True)

In [37]:
S0101_data.head()

Unnamed: 0,S0101_C01_001E,S0101_C01_007E,S0101_C01_008E,S0101_C01_009E,S0101_C01_010E,S0101_C01_011E,S0101_C01_012E,S0101_C01_013E,S0101_C01_014E,S0101_C01_021E,S0101_C01_023E,S0101_C01_030E,S0101_C03_001E,S0101_C05_001E
0,4844,252,197,335,163,312,447,318,318,250,205,1057,2486,2358
1,4838,843,907,753,444,419,314,223,73,0,634,84,4658,180
2,7511,586,1339,1333,1171,980,790,485,315,5,268,214,7425,86
3,4465,483,199,268,413,204,224,253,191,190,383,654,2273,2192
4,5148,552,355,484,204,182,334,279,373,344,448,730,2530,2618


In [38]:
S0101_data.shape

(5265, 14)

Reorder the columns and then compress the bins if necessary to align with our wanted columns.

In [46]:
columns = list(S0101_data.columns)

In [48]:
columns = ['S0101_C01_001E',
    'S0101_C01_021E',
    'S0101_C01_023E',
    'S0101_C01_007E',
    'S0101_C01_008E',
    'S0101_C01_009E',
    'S0101_C01_010E',
    'S0101_C01_011E',
    'S0101_C01_012E',
    'S0101_C01_013E',
    'S0101_C01_014E', 
    'S0101_C01_030E',
    'S0101_C03_001E',
    'S0101_C05_001E']

In [49]:
S0101_data = S0101_data[columns]

In [50]:
S0101_data.head(2)

Unnamed: 0,S0101_C01_001E,S0101_C01_021E,S0101_C01_023E,S0101_C01_007E,S0101_C01_008E,S0101_C01_009E,S0101_C01_010E,S0101_C01_011E,S0101_C01_012E,S0101_C01_013E,S0101_C01_014E,S0101_C01_030E,S0101_C03_001E,S0101_C05_001E
0,4844,250,205,252,197,335,163,312,447,318,318,1057,2486,2358
1,4838,0,634,843,907,753,444,419,314,223,73,84,4658,180


In [None]:
S0101_data.iloc

In [43]:
new_bins = ['total_pop', '15_17', '18_24', '25_29', '30_34', '35_49', '50_64', '65+', 'male', 'female']

In [51]:
for i in S0101_data.columns:
    print(column_dictionary[i])

Estimate!!Total!!Total population
Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!15 to 17 years
Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!18 to 24 years
Estimate!!Total!!Total population!!AGE!!25 to 29 years
Estimate!!Total!!Total population!!AGE!!30 to 34 years
Estimate!!Total!!Total population!!AGE!!35 to 39 years
Estimate!!Total!!Total population!!AGE!!40 to 44 years
Estimate!!Total!!Total population!!AGE!!45 to 49 years
Estimate!!Total!!Total population!!AGE!!50 to 54 years
Estimate!!Total!!Total population!!AGE!!55 to 59 years
Estimate!!Total!!Total population!!AGE!!60 to 64 years
Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!65 years and over
Estimate!!Male!!Total population
Estimate!!Female!!Total population


In [74]:
S0101_data_copy = S0101_data.copy()

In [75]:
def collapse_columns_into_one(df, cols: list[int], target: int, drops=None):
    df[target] = df[cols].sum(axis=1)
    if drops is None:        
        cols.pop(cols.index(target))
        df.drop(index = cols, axis=1, inplace=True)
    else:
        df.drop(index = drops, axis=1, inplace=True)
    return df    

In [76]:
temp =  collapse_columns_into_one(S0101_data_copy, [5,6,7], 5)

KeyError: "None of [Int64Index([5, 6, 7], dtype='int64')] are in the [columns]"

In [63]:
hehe = [5,6,7]

In [67]:
hehe.pop(hehe.index(5))

5

In [68]:
hehe

[6, 7]