#### Criteria for classifying a tract as a low income community:
- The **poverty rate is at least 20 percent**, OR
- The **median family income does not exceed 80 percent of statewide median family income** or, if in a metropolitan area, **the greater of 80 percent statewide median family income or 80 percent of metropolitan area median family income**


#### Steps done in this notebook to prepare the data:
- Dowload the datasets `Table B17020: Poverty Status in the Past 12 Months - Tracts` and `Table B19113: Median Family Income in the Past 12 Months (in 2020 inflation-adjusted dollars) - Tracts, Metropolitan area, State`
- Flter the poverty dataframe to contain only the needed columns and make a calculated column `poverty_percent` to later filter above 20% on it. Save it as `poverty_clean.csv`
- For the median income dataframe (tracts, state) remove the unneccessary columns and save it as `income_clean.csv`
- Merge the dataframes to have the `st_med_income` column to the tract level data and save it as `income_tract_st_merged.csv`.
- For the MSA median income dataframe, clean it similary and keep only the necessary columns.
- Since MSA dataframe has no attribute to join with other, need to perform spatial join, hence download the shapefile and merge the `msa_med_income` column to the shape file and save it as a shape file to `msa_income_shp.shp`
- Now, MSA's have geomtries to spatially join with the tracts, but the poverty and income datasets do not.
- Get the tract level shape files for all the tracts in the nation:
    - To do this, might have to use a web scraper like BS4 since all tracts cannot be downloaded at once.
    - Once downloaded and merged all the tracts into a single geo Dataframe, save it to `merged_tracts/merged_tracts.shp`
- Merge all the datasets into a single shape file.
    - First merge the tract level median income dataset with the merged tracts file then merge with the poverty dataset to obtain the geometries of the data.
    - Then proceed with a spatial join between the MSA shapefile and the merged geo dataframe.
- Dissolve the geometries by the tractId since there would be duplicates appearing from the spatial join as it's possible that some tracts have multiple overlapping MSA geometries.
- Apply the conditions for classifying a tract as a low income community and save the final data as `low_income_tracts/low_income_tracts.shp` 

In [122]:
import dask.dataframe as dd
import zipfile
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import requests
from bs4 import BeautifulSoup
import urllib.parse
pd.set_option('display.max_columns', None)
current_dir = os.getcwd()
wd = os.path.join(current_dir,'../data/Low Income Communities/')
wd = wd.replace('\\', '/')

#### 1. Poverty Dataset

In [88]:
poverty_df = pd.read_csv(wd + "poverty/R13409161_SL140.csv")
poverty_df.head()

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_US,Geo_REGION,Geo_DIVISION,Geo_STATECE,Geo_STATE,...,ACS20_5yr_B17020008s,ACS20_5yr_B17020009s,ACS20_5yr_B17020010s,ACS20_5yr_B17020011s,ACS20_5yr_B17020012s,ACS20_5yr_B17020013s,ACS20_5yr_B17020014s,ACS20_5yr_B17020015s,ACS20_5yr_B17020016s,ACS20_5yr_B17020017s
0,ACSSF,al,140,0,1788,,,,,1,...,7.272727,7.272727,227.2727,41.21212,86.66666,19.39394,115.7576,46.66667,25.45455,44.84848
1,ACSSF,al,140,0,1789,,,,,1,...,18.78788,7.272727,157.5758,24.84848,16.36364,29.69697,112.7273,55.75758,21.81818,18.18182
2,ACSSF,al,140,0,1790,,,,,1,...,26.06061,9.69697,331.5151,35.15152,89.69697,65.45454,207.2727,86.66666,35.75758,18.78788
3,ACSSF,al,140,0,1791,,,,,1,...,6.060606,7.272727,301.2121,38.78788,40.0,48.48485,196.9697,92.72727,93.33334,93.93939
4,ACSSF,al,140,0,1792,,,,,1,...,7.272727,7.272727,386.0606,67.87878,65.45454,86.06061,309.0909,85.45454,83.0303,32.12121


In [89]:
nulls_df = pd.DataFrame(poverty_df.isnull().sum().sort_values(ascending=False)/ poverty_df.shape[0] * 100, columns=['Nulls%'])
nulls_df

Unnamed: 0,Nulls%
Geo_BTBG,100.0
Geo_SDUNI,100.0
Geo_NECTA,100.0
Geo_CNECTA,100.0
Geo_NECTADIV,100.0
...,...
ACS20_5yr_B17020005,0.0
ACS20_5yr_B17020006,0.0
ACS20_5yr_B17020007,0.0
ACS20_5yr_B17020008,0.0


In [90]:
cols_removal = nulls_df[nulls_df['Nulls%']==100].index.to_list()
# cols_removal

In [91]:
poverty_df[cols_removal].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 0 to 85394
Data columns (total 43 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Geo_BTBG      0 non-null      float64
 1   Geo_SDUNI     0 non-null      float64
 2   Geo_NECTA     0 non-null      float64
 3   Geo_CNECTA    0 non-null      float64
 4   Geo_NECTADIV  0 non-null      float64
 5   Geo_UA        0 non-null      float64
 6   Geo_CDCURR    0 non-null      float64
 7   Geo_SLDU      0 non-null      float64
 8   Geo_ZCTA5     0 non-null      float64
 9   Geo_SUBMCD    0 non-null      float64
 10  Geo_SDELM     0 non-null      float64
 11  Geo_SDSEC     0 non-null      float64
 12  Geo_UR        0 non-null      float64
 13  Geo_MACC      0 non-null      float64
 14  Geo_PCI       0 non-null      float64
 15  Geo_PUMA5     0 non-null      float64
 16  Geo_BTTR      0 non-null      float64
 17  Geo_PLACESE   0 non-null      float64
 18  Geo_UACP      0 non-null  

In [92]:
poverty_df.drop(cols_removal, axis=1, inplace=True)
poverty_df.head()

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_STATE,Geo_COUNTY,Geo_TRACT,Geo_GEOID,Geo_NAME,...,ACS20_5yr_B17020008s,ACS20_5yr_B17020009s,ACS20_5yr_B17020010s,ACS20_5yr_B17020011s,ACS20_5yr_B17020012s,ACS20_5yr_B17020013s,ACS20_5yr_B17020014s,ACS20_5yr_B17020015s,ACS20_5yr_B17020016s,ACS20_5yr_B17020017s
0,ACSSF,al,140,0,1788,1,1,20100,14000US01001020100,Census Tract 201,...,7.272727,7.272727,227.2727,41.21212,86.66666,19.39394,115.7576,46.66667,25.45455,44.84848
1,ACSSF,al,140,0,1789,1,1,20200,14000US01001020200,Census Tract 202,...,18.78788,7.272727,157.5758,24.84848,16.36364,29.69697,112.7273,55.75758,21.81818,18.18182
2,ACSSF,al,140,0,1790,1,1,20300,14000US01001020300,Census Tract 203,...,26.06061,9.69697,331.5151,35.15152,89.69697,65.45454,207.2727,86.66666,35.75758,18.78788
3,ACSSF,al,140,0,1791,1,1,20400,14000US01001020400,Census Tract 204,...,6.060606,7.272727,301.2121,38.78788,40.0,48.48485,196.9697,92.72727,93.33334,93.93939
4,ACSSF,al,140,0,1792,1,1,20501,14000US01001020501,Census Tract 205.01,...,7.272727,7.272727,386.0606,67.87878,65.45454,86.06061,309.0909,85.45454,83.0303,32.12121


#### Data Dictionary for the Poverty_df table:
Variables 
-      FILEID:         File identification
-      STUSAB:         State Postal Abbreviation
-      SUMLEV:         Summary Level
-      GEOCOMP:        Geographic Component
-      LOGRECNO:       Logical Record Number
-      STATE:          State (FIPS Code)
-      COUNTY:         County of current residence
-      TRACT:          Census Tract
-      GEOID:          Geographic Identifier
-      NAME:           Area Name
-      QName:          Qualifying Name
-      FIPS:           FIPS
-      AREALAND:       Area (Land)
-      AREAWATR:       Area (Water)
-      B17020001:      Total
-      B17020002:      Total: Income In The Past 12 Months Below Poverty Level
-      B17020001s:     Std. Error: Total
-      B17020002s:     Std. Error: Total: Income In The Past 12 Months Below Poverty Level
-      B17020010s:     Std. Error: Total: Income In The Past 12 Months At or Above Poverty Level

In [93]:
# removing all other columns apart fromt the ones mentioned above
cols_to_keep = ['Geo_STUSAB','Geo_GEOID','Geo_QName','Geo_FIPS','Geo_AREALAND','Geo_AREAWATR','ACS20_5yr_B17020001','ACS20_5yr_B17020002',]
poverty_df = poverty_df[cols_to_keep]
poverty_df.columns = ['state','geo_id','tract_name','tractId','area_land','area_water','total_pop','poverty_pop']
poverty_df['state'] = poverty_df['state'].apply(lambda x: x.upper())

In [94]:
poverty_df['poverty_percent'] = (poverty_df['poverty_pop']/poverty_df['total_pop'])*100
# poverty_df = poverty_df[poverty_df['poverty_percent'] >= 20]
poverty_df.to_csv(wd+ 'poverty/poverty_clean.csv', index=False)
poverty_df.head()

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,total_pop,poverty_pop,poverty_percent
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,1941,265,13.652756
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,1511,257,17.008604
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,3694,533,14.428803
3,AL,14000US01001020400,"Census Tract 204, Autauga County, Alabama",1001020400,6384282,8408,3539,281,7.940096
4,AL,14000US01001020501,"Census Tract 205.01, Autauga County, Alabama",1001020501,6203654,0,4306,802,18.625174


In [95]:
poverty_df = pd.read_csv(wd+ 'poverty/poverty_clean.csv')
poverty_df.shape

(85395, 9)

### 2.1 Income Dataset (tract level)

In [11]:
li_df = pd.read_csv(wd + "income/R13409069_SL140.csv")
li_df.shape

(85395, 59)

In [12]:
li_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85395 entries, 0 to 85394
Data columns (total 59 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Geo_FILEID            85395 non-null  object 
 1   Geo_STUSAB            85395 non-null  object 
 2   Geo_SUMLEV            85395 non-null  int64  
 3   Geo_GEOCOMP           85395 non-null  int64  
 4   Geo_LOGRECNO          85395 non-null  int64  
 5   Geo_US                0 non-null      float64
 6   Geo_REGION            0 non-null      float64
 7   Geo_DIVISION          0 non-null      float64
 8   Geo_STATECE           0 non-null      float64
 9   Geo_STATE             85395 non-null  int64  
 10  Geo_COUNTY            85395 non-null  int64  
 11  Geo_COUSUB            0 non-null      float64
 12  Geo_PLACE             0 non-null      float64
 13  Geo_TRACT             85395 non-null  int64  
 14  Geo_BLKGRP            0 non-null      float64
 15  Geo_CONCIT         

In [13]:
nulls_df = pd.DataFrame(li_df.isnull().sum().sort_values(ascending=False)/ li_df.shape[0] * 100, columns=['Nulls%'])
cols_removal = nulls_df[nulls_df['Nulls%']==100].index.to_list()
li_df = li_df.drop(columns=cols_removal)

In [14]:
li_df.head(2)

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_STATE,Geo_COUNTY,Geo_TRACT,Geo_GEOID,Geo_NAME,Geo_QName,Geo_FIPS,Geo_AREALAND,Geo_AREAWATR,ACS20_5yr_B19113001,ACS20_5yr_B19113001s
0,ACSSF,al,140,0,1788,1,1,20100,14000US01001020100,Census Tract 201,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,70699.0,6327.878788
1,ACSSF,al,140,0,1789,1,1,20200,14000US01001020200,Census Tract 202,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,50133.0,1558.181818


### Data Dictionary:

Variables 
-      FILEID:         File identification
-      STUSAB:         State Postal Abbreviation
-      SUMLEV:         Summary Level
-      GEOCOMP:        Geographic Component
-      LOGRECNO:       Logical Record Number
-      STATE:          State (FIPS Code)
-      COUNTY:         County of current residence
-      TRACT:          Census Tract
-      GEOID:          Geographic Identifier
-      NAME:           Area Name
-      BTTR:           Tribal Tract
-      BTBG:           Tribal Block Group
-      QName:          Qualifying Name
-      FIPS:           FIPS
-      AREALAND:       Area (Land)
-      AREAWATR:       Area (Water)
-      B19113001:      Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)
-      B19113001s:     Std. Error: Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)

In [15]:
li_df.columns

Index(['Geo_FILEID', 'Geo_STUSAB', 'Geo_SUMLEV', 'Geo_GEOCOMP', 'Geo_LOGRECNO',
       'Geo_STATE', 'Geo_COUNTY', 'Geo_TRACT', 'Geo_GEOID', 'Geo_NAME',
       'Geo_QName', 'Geo_FIPS', 'Geo_AREALAND', 'Geo_AREAWATR',
       'ACS20_5yr_B19113001', 'ACS20_5yr_B19113001s'],
      dtype='object')

In [16]:
li_cols_to_keep = ['Geo_STUSAB','Geo_GEOID','Geo_QName', 'Geo_FIPS', 'Geo_AREALAND', 'Geo_AREAWATR','ACS20_5yr_B19113001']
li_df = li_df[li_cols_to_keep]
li_df.columns = ['state','geo_id','tract_name','tractId','area_land','area_water','median_income']
li_df['state'] = li_df['state'].apply(lambda x: x.upper())
li_df.to_csv(wd+ 'income/income_clean.csv', index=False)
li_df.head()

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,median_income
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,70699.0
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,50133.0
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,70111.0
3,AL,14000US01001020400,"Census Tract 204, Autauga County, Alabama",1001020400,6384282,8408,75580.0
4,AL,14000US01001020501,"Census Tract 205.01, Autauga County, Alabama",1001020501,6203654,0,90879.0


In [6]:
li_df = pd.read_csv(wd+ 'income/income_clean.csv')
li_df.shape

(85395, 7)

#### 2.2 Income State Level

In [23]:
### Income state wide
income_state_df = pd.read_csv(wd + 'income/R13411548_SL040.csv')
income_state_df.head()

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_US,Geo_REGION,Geo_DIVISION,Geo_STATECE,Geo_STATE,...,Geo_AREAWATR,Geo_PLACESE,Geo_UACP,Geo_VTD,Geo_ZCTA3,Geo_TAZ,Geo_UGA,Geo_PUMA1,ACS20_5yr_B19113001,ACS20_5yr_B19113001s
0,ACSSF,al,40,0,1,,,,,1,...,4591915034,,,,,,,,66772,298.181818
1,ACSSF,ak,40,0,1,,,,,2,...,245380162784,,,,,,,,92648,805.454545
2,ACSSF,az,40,0,1,,,,,4,...,858853288,,,,,,,,73456,236.969697
3,ACSSF,ar,40,0,1,,,,,5,...,3121867339,,,,,,,,62067,328.484848
4,ACSSF,ca,40,0,1,,,,,6,...,20294133830,,,,,,,,89798,227.878788


In [24]:
nulls_df = pd.DataFrame(income_state_df.isnull().sum().sort_values(ascending=False)/ income_state_df.shape[0] * 100, columns=['Nulls%'])
cols_removal = nulls_df[nulls_df['Nulls%']==100].index.to_list()
income_state_df = income_state_df.drop(columns=cols_removal)

In [25]:
income_state_df.head(3)

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_STATE,Geo_GEOID,Geo_NAME,Geo_QName,Geo_FIPS,Geo_AREALAND,Geo_AREAWATR,ACS20_5yr_B19113001,ACS20_5yr_B19113001s
0,ACSSF,al,40,0,1,1,04000US01,Alabama,Alabama,1,131175460655,4591915034,66772,298.181818
1,ACSSF,ak,40,0,1,2,04000US02,Alaska,Alaska,2,1478941109938,245380162784,92648,805.454545
2,ACSSF,az,40,0,1,4,04000US04,Arizona,Arizona,4,294360991275,858853288,73456,236.969697


### Data Dictionary
 
Variables 
-      FILEID:         File identification
-      STUSAB:         State Postal Abbreviation
-      SUMLEV:         Summary Level
-      GEOCOMP:        Geographic Component
-      LOGRECNO:       Logical Record Number
-      STATE:          State (FIPS Code)
-      GEOID:          Geographic Identifier
-      NAME:           Area Name
-      QName:          Qualifying Name
-      FIPS:           FIPS
-      AREALAND:       Area (Land)
-      AREAWATR:       Area (Water)
-      B19113001:      Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)
-      B19113001s:     Std. Error: Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)

In [28]:
cols_to_keep_stw = ['Geo_STUSAB','Geo_GEOID','Geo_QName','Geo_AREALAND', 'Geo_AREAWATR','ACS20_5yr_B19113001']
income_state_df = income_state_df[cols_to_keep_stw]
income_state_df.columns = ['state','geo_id','st_name','area_land','area_water','st_med_income']
income_state_df['state'] = income_state_df['state'].apply(lambda x: x.upper())
income_state_df.to_csv(wd+ 'income/income_state_clean.csv', index=False)
income_state_df.head()

Unnamed: 0,state,geo_id,st_name,area_land,area_water,st_med_income
0,AL,04000US01,Alabama,131175460655,4591915034,66772
1,AK,04000US02,Alaska,1478941109938,245380162784,92648
2,AZ,04000US04,Arizona,294360991275,858853288,73456
3,AR,04000US05,Arkansas,134660850501,3121867339,62067
4,CA,04000US06,California,403671196038,20294133830,89798


In [7]:
income_state_df = pd.read_csv(wd+ 'income/income_state_clean.csv')
income_state_df.shape

(52, 6)

#### 2.3 Merge the statewide income with the tract level income

In [49]:
income_merged = pd.merge(li_df, income_state_df[['state','st_med_income']], on=['state'], how='left')
income_merged.to_csv(wd+ 'income/income_tract_st_merged.csv', index=False)
income_merged.head()

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,median_income,st_med_income
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,70699.0,66772
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,50133.0,66772
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,70111.0,66772
3,AL,14000US01001020400,"Census Tract 204, Autauga County, Alabama",1001020400,6384282,8408,75580.0,66772
4,AL,14000US01001020501,"Census Tract 205.01, Autauga County, Alabama",1001020501,6203654,0,90879.0,66772


#### 2.4 Income MSA level data

In [41]:
income_msa_df = pd.read_csv(wd + 'income/R13411591_SL320.csv')
income_msa_df.head(3)

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_US,Geo_REGION,Geo_DIVISION,Geo_STATECE,Geo_STATE,...,Geo_AREAWATR,Geo_PLACESE,Geo_UACP,Geo_VTD,Geo_ZCTA3,Geo_TAZ,Geo_UGA,Geo_PUMA1,ACS20_5yr_B19113001,ACS20_5yr_B19113001s
0,ACSSF,al,320,0,8583,,,,,1,...,148503356.0,,,,,,,,61200,1083.636364
1,ACSSF,al,320,0,8584,,,,,1,...,168755442.0,,,,,,,,55739,1167.272727
2,ACSSF,al,320,0,8585,,,,,1,...,16624267.0,,,,,,,,59609,1710.909091


In [42]:
nulls_df = pd.DataFrame(income_msa_df.isnull().sum().sort_values(ascending=False)/ income_msa_df.shape[0] * 100, columns=['Nulls%'])
cols_removal = nulls_df[nulls_df['Nulls%']==100].index.to_list()
income_msa_df = income_msa_df.drop(columns=cols_removal)

In [43]:
income_msa_df.head(3)

Unnamed: 0,Geo_FILEID,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_LOGRECNO,Geo_STATE,Geo_CBSA,Geo_GEOID,Geo_NAME,Geo_QName,Geo_FIPS,Geo_AREALAND,Geo_AREAWATR,ACS20_5yr_B19113001,ACS20_5yr_B19113001s
0,ACSSF,al,320,0,8583,1,10700,32000US0110700,Albertville,"Albertville, AL Micro Area; Alabama",110700,1465523000.0,148503356.0,61200,1083.636364
1,ACSSF,al,320,0,8584,1,10760,32000US0110760,Alexander City,"Alexander City, AL Micro Area; Alabama",110760,3541671000.0,168755442.0,55739,1167.272727
2,ACSSF,al,320,0,8585,1,11500,32000US0111500,Anniston-Oxford,"Anniston-Oxford, AL Metro Area; Alabama",111500,1569190000.0,16624267.0,59609,1710.909091


#### Data Dictionary
 
Variables 
-      FILEID:         File identification
-      STUSAB:         State Postal Abbreviation
-      SUMLEV:         Summary Level
-      GEOCOMP:        Geographic Component
-      LOGRECNO:       Logical Record Number
-      STATE:          State (FIPS Code)
-      CBSA:           Metropolitan and Micropolitan Statistical Area
-      GEOID:          Geographic Identifier
-      NAME:           Area Name
-      QName:          Qualifying Name
-      FIPS:           FIPS
-      AREALAND:       Area (Land)
-      AREAWATR:       Area (Water)
-      B19113001:      Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)
-      B19113001s:     Std. Error: Median Family Income In The Past 12 Months (In 2020 Inflation-Adjusted Dollars)

In [44]:
income_msa_df.columns

Index(['Geo_FILEID', 'Geo_STUSAB', 'Geo_SUMLEV', 'Geo_GEOCOMP', 'Geo_LOGRECNO',
       'Geo_STATE', 'Geo_CBSA', 'Geo_GEOID', 'Geo_NAME', 'Geo_QName',
       'Geo_FIPS', 'Geo_AREALAND', 'Geo_AREAWATR', 'ACS20_5yr_B19113001',
       'ACS20_5yr_B19113001s'],
      dtype='object')

In [45]:
msa_cols_to_keep = ['Geo_STUSAB','Geo_CBSA','Geo_GEOID','Geo_QName','Geo_FIPS','Geo_AREALAND',  'Geo_AREAWATR','ACS20_5yr_B19113001']
income_msa_df = income_msa_df[msa_cols_to_keep]
income_msa_df.columns = ['state','cbsa_id','geo_id','msa_name', 'msa_tract_id','area_land','area_water','msa_medInc']
income_msa_df['state'] = income_msa_df['state'].apply(lambda x: x.upper())
income_msa_df.to_csv(wd+ 'income/income_msa_clean.csv', index=False)
income_msa_df.head()

Unnamed: 0,state,cbsa_id,geo_id,msa_name,msa_tract_id,area_land,area_water,msa_med_income
0,AL,10700,32000US0110700,"Albertville, AL Micro Area; Alabama",110700,1465523000.0,148503356.0,61200
1,AL,10760,32000US0110760,"Alexander City, AL Micro Area; Alabama",110760,3541671000.0,168755442.0,55739
2,AL,11500,32000US0111500,"Anniston-Oxford, AL Metro Area; Alabama",111500,1569190000.0,16624267.0,59609
3,AL,12120,32000US0112120,"Atmore, AL Micro Area; Alabama",112120,,,47336
4,AL,12220,32000US0112220,"Auburn-Opelika, AL Metro Area; Alabama",112220,1573511000.0,21530362.0,75091


In [9]:
income_msa_df = pd.read_csv(wd+ 'income/income_msa_clean.csv')
income_msa_df.shape

(1011, 8)

In [47]:
income_merged.head(3)

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,median_income,st_med_income
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,70699.0,66772
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,50133.0,66772
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,70111.0,66772


In [38]:
poverty_df.head(3)

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,total_pop,poverty_pop,poverty_percent
7,AL,14000US01001020600,"Census Tract 206, Autauga County, Alabama",1001020600,8041611,59779,3523,831,23.587851
8,AL,14000US01001020700,"Census Tract 207, Autauga County, Alabama",1001020700,22411848,772012,3528,1058,29.988662
10,AL,14000US01001020803,"Census Tract 208.03, Autauga County, Alabama",1001020803,129030127,584928,4350,902,20.735632


##### 2.5 Shape file for the MSAs

In [14]:
msa_shp_df = gpd.read_file(wd+ "income/2021_msa_shp/tl_2021_us_cbsa.shp")
msa_shp_df.head(3)

Unnamed: 0,CSAFP,CBSAFP,GEOID,NAME,NAMELSAD,LSAD,MEMI,MTFCC,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
0,122,12020,12020,"Athens-Clarke County, GA","Athens-Clarke County, GA Metro Area",M1,1,G3110,2654607902,26109459,33.943984,-83.2138965,"POLYGON ((-83.36003 34.04057, -83.36757 34.043..."
1,122,12060,12060,"Atlanta-Sandy Springs-Alpharetta, GA","Atlanta-Sandy Springs-Alpharetta, GA Metro Area",M1,1,G3110,22495873026,386782308,33.693728,-84.3999113,"POLYGON ((-84.27014 32.99101, -84.27084 32.991..."
2,428,12100,12100,"Atlantic City-Hammonton, NJ","Atlantic City-Hammonton, NJ Metro Area",M1,1,G3110,1438775279,301270067,39.4693555,-74.6337591,"POLYGON ((-74.58640 39.30989, -74.58665 39.309..."


In [None]:
sum(msa_shp_df['CBSAFP']== msa_shp_df['GEOID']) == len(msa_shp_df)
## Both the columns represent the same information, dropping one of them

In [15]:
#CSAFP,NAME, can also be dropped
##Data dictionary doc: https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2021/TGRSHP2021_TechDoc.pdf
msa_shp_df = msa_shp_df[['GEOID','NAMELSAD','LSAD','ALAND','AWATER','INTPTLAT','INTPTLON','geometry']]
msa_shp_df.columns = ['cbsa_id','msa_name','lsad','area_land','area_water','lat','lon','geometry']
msa_shp_df['cbsa_id'] = msa_shp_df['cbsa_id'].astype(int)
if not os.path.exists(wd+ "income/clean_msa_shp/"):
    os.makedirs(wd+ "income/msa_clean_shp")
msa_shp_df.to_file(wd+ "income/msa_clean_shp/msa_clean.shp")
msa_shp_df.head(3)

Unnamed: 0,cbsa_id,msa_name,lsad,area_land,area_water,lat,lon,geometry
0,12020,"Athens-Clarke County, GA Metro Area",M1,2654607902,26109459,33.943984,-83.2138965,"POLYGON ((-83.36003 34.04057, -83.36757 34.043..."
1,12060,"Atlanta-Sandy Springs-Alpharetta, GA Metro Area",M1,22495873026,386782308,33.693728,-84.3999113,"POLYGON ((-84.27014 32.99101, -84.27084 32.991..."
2,12100,"Atlantic City-Hammonton, NJ Metro Area",M1,1438775279,301270067,39.4693555,-74.6337591,"POLYGON ((-74.58640 39.30989, -74.58665 39.309..."


- Obtained the MSA shape file, median income (State and tract level) and poverty data (tract level). 
- To merge the MSA shape file with the other datasets, should identify the geometries of the tracts. 
- Download all the tract shape files from Tiger/Line database from the census website.

In [16]:
income_msa_df.head(3)

Unnamed: 0,state,cbsa_id,geo_id,msa_name,msa_tract_id,area_land,area_water,msa_med_income
0,AL,10700,32000US0110700,"Albertville, AL Micro Area; Alabama",110700,1465523000.0,148503356.0,61200
1,AL,10760,32000US0110760,"Alexander City, AL Micro Area; Alabama",110760,3541671000.0,168755442.0,55739
2,AL,11500,32000US0111500,"Anniston-Oxford, AL Metro Area; Alabama",111500,1569190000.0,16624267.0,59609


In [48]:
msa_income_shp = pd.merge(income_msa_df, msa_shp_df[['cbsa_id','lsad','lat','lon','geometry']], on=['cbsa_id'], how='left')
#save the file
if not os.path.exists(wd+ "income/msa_income_shp/"):
    os.makedirs(wd+ "income/msa_income_shp")
msa_income_shp = gpd.GeoDataFrame(msa_income_shp, geometry='geometry')
msa_income_shp.to_file(wd+ "income/msa_income_shp/msa_income_shp.shp")
msa_income_shp.head(3)

  msa_income_shp.to_file(wd+ "income/msa_income_shp/msa_income_shp.shp")


Unnamed: 0,state,cbsa_id,geo_id,msa_name,msa_tract_id,area_land,area_water,msa_med_income,lsad,lat,lon,geometry
0,AL,10700,32000US0110700,"Albertville, AL Micro Area; Alabama",110700,1465523000.0,148503356.0,61200,M2,34.3095637,-86.3216681,"POLYGON ((-86.14981 34.53363, -86.14864 34.533..."
1,AL,10760,32000US0110760,"Alexander City, AL Micro Area; Alabama",110760,3541671000.0,168755442.0,55739,M2,32.9004862,-86.002932,"POLYGON ((-86.00917 33.09026, -86.00717 33.090..."
2,AL,11500,32000US0111500,"Anniston-Oxford, AL Metro Area; Alabama",111500,1569190000.0,16624267.0,59609,M1,33.7705162,-85.8279089,"POLYGON ((-85.63688 33.84649, -85.62875 33.846..."


### 3 Download and merge the tract data for all the census tracts

In [25]:
base_url = 'https://www2.census.gov/geo/tiger/TIGER2021/TRACT/'

response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')
links = soup.find_all('a')

for link in links:
    href = link.get('href')
    if href is not None and href.endswith('.zip'):
        download_url = urllib.parse.urljoin(base_url, href)
        file_name = href.split('/')[-1]
        print(f'Downloading {file_name}...')
        response = requests.get(download_url)
        if not os.path.exists(wd+ "all_tracts/"):
            os.makedirs(wd+ "all_tracts/")
        save_path = wd+ "all_tracts/" + file_name
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f'{file_name} downloaded successfully.')

Downloading tl_2021_01_tract.zip...
tl_2021_01_tract.zip downloaded successfully.
Downloading tl_2021_02_tract.zip...
tl_2021_02_tract.zip downloaded successfully.
Downloading tl_2021_04_tract.zip...
tl_2021_04_tract.zip downloaded successfully.
Downloading tl_2021_05_tract.zip...
tl_2021_05_tract.zip downloaded successfully.
Downloading tl_2021_06_tract.zip...
tl_2021_06_tract.zip downloaded successfully.
Downloading tl_2021_08_tract.zip...
tl_2021_08_tract.zip downloaded successfully.
Downloading tl_2021_09_tract.zip...
tl_2021_09_tract.zip downloaded successfully.
Downloading tl_2021_10_tract.zip...
tl_2021_10_tract.zip downloaded successfully.
Downloading tl_2021_11_tract.zip...
tl_2021_11_tract.zip downloaded successfully.
Downloading tl_2021_12_tract.zip...
tl_2021_12_tract.zip downloaded successfully.
Downloading tl_2021_13_tract.zip...
tl_2021_13_tract.zip downloaded successfully.
Downloading tl_2021_15_tract.zip...
tl_2021_15_tract.zip downloaded successfully.
Downloading tl_2

In [35]:
folder_path = wd+ 'all_tracts/'

zip_files = [file for file in os.listdir(folder_path) if file.endswith('.zip')]

#List to store GeoDataFrames
gdfs = []

#Iterate over each zip file and read the .shp file
for zip_file in zip_files:
    with zipfile.ZipFile(os.path.join(folder_path, zip_file), 'r') as zf:
        shp_file = [file for file in zf.namelist() if file.endswith('.shp')][0]
        gdf = gpd.read_file(f'zip://{os.path.join(folder_path, zip_file)}!{shp_file}')
        gdfs.append(gdf)

#Check if the list has all the tracts in it
print(len(gdfs))

56


In [39]:
#check if all the dataframes inside the gdfs list have same columns
for i in range(len(gdfs)):
    if i == 0:
        continue
    else:  
        assert gdfs[i].columns.to_list() == gdfs[i-1].columns.to_list()


In [44]:
#concat all the dataframes inside the gdfs list to a single geo dataframe
merged_tracts = pd.concat(gdfs, ignore_index=True)
print(merged_tracts.shape)
merged_tracts.head(3)

(85528, 13)


Unnamed: 0,STATEFP,COUNTYFP,TRACTCE,GEOID,NAME,NAMELSAD,MTFCC,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
0,1,79,979201,1079979201,9792.01,Census Tract 9792.01,G5020,S,173543715,33343864,34.7177132,-87.3401349,"POLYGON ((-87.43611 34.72743, -87.43610 34.727..."
1,1,79,979202,1079979202,9792.02,Census Tract 9792.02,G5020,S,132640589,788347,34.6396492,-87.3633477,"POLYGON ((-87.45696 34.61352, -87.45693 34.613..."
2,1,79,979502,1079979502,9795.02,Census Tract 9795.02,G5020,S,75015361,645300,34.556996,-87.2404514,"POLYGON ((-87.29614 34.54337, -87.29610 34.543..."


In [43]:
len(merged_tracts['STATEFP'].unique())

56

In [61]:
#Clean the merged_tracts dataframe
merged_tracts = merged_tracts[['STATEFP','GEOID','NAMELSAD','ALAND','AWATER','INTPTLAT','INTPTLON','geometry']]
merged_tracts.columns = ['stateFP','tract_id','tract_name','area_land','area_water','lat','lon','geometry']
merged_tracts['tract_id'] = merged_tracts['tract_id'].astype('int64')
merged_tracts['stateFP'] = merged_tracts['stateFP'].astype(int)
merged_tracts.head(3)

Unnamed: 0,stateFP,tract_id,tract_name,area_land,area_water,lat,lon,geometry
0,1,1079979201,Census Tract 9792.01,173543715,33343864,34.7177132,-87.3401349,"POLYGON ((-87.43611 34.72743, -87.43610 34.727..."
1,1,1079979202,Census Tract 9792.02,132640589,788347,34.6396492,-87.3633477,"POLYGON ((-87.45696 34.61352, -87.45693 34.613..."
2,1,1079979502,Census Tract 9795.02,75015361,645300,34.556996,-87.2404514,"POLYGON ((-87.29614 34.54337, -87.29610 34.543..."


In [62]:
# Save the merged shapefile
if not os.path.exists(wd+ "merged_tracts"):
    os.makedirs(wd+ "merged_tracts")
output_path = wd + 'merged_tracts/merged_tracts.shp'
merged_tracts.to_file(output_path)

#### 4 Merge all the datasets together

In [96]:
### All datasets are now cleaned and ready to be merged for classifying the tracts as low income or not
#read the data again incase the notebook is restarted
poverty_df = pd.read_csv(wd+ 'poverty/poverty_clean.csv')
income_merged = pd.read_csv(wd+ 'income/income_tract_st_merged.csv')
msa_income_shp = gpd.read_file(wd+ "income/msa_income_shp/msa_income_shp.shp")
merged_tracts = gpd.read_file(wd + "merged_tracts/merged_tracts.shp")

In [63]:
merged_tracts.head(3)

Unnamed: 0,stateFP,tract_id,tract_name,area_land,area_water,lat,lon,geometry
0,1,1079979201,Census Tract 9792.01,173543715,33343864,34.7177132,-87.3401349,"POLYGON ((-87.43611 34.72743, -87.43610 34.727..."
1,1,1079979202,Census Tract 9792.02,132640589,788347,34.6396492,-87.3633477,"POLYGON ((-87.45696 34.61352, -87.45693 34.613..."
2,1,1079979502,Census Tract 9795.02,75015361,645300,34.556996,-87.2404514,"POLYGON ((-87.29614 34.54337, -87.29610 34.543..."


In [64]:
income_merged.head(3)

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,median_income,st_med_income
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,70699.0,66772
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,50133.0,66772
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,70111.0,66772


##### 4.1 Merging Income tract level data with the tracts Shapefile

In [113]:
#Merging the income_merged dataframe with the tracts shapefile
income_shp_merged = pd.merge(income_merged[['state','geo_id','tract_name','tractId','median_income','st_med_income']], 
                             merged_tracts[['stateFP','tract_id','tract_name','area_land','area_water','lat','lon','geometry']], 
                             left_on=['tractId'], right_on =['tract_id'], how='left')

In [114]:
income_shp_merged.head(3)

Unnamed: 0,state,geo_id,tract_name_x,tractId,median_income,st_med_income,stateFP,tract_id,tract_name_y,area_land,area_water,lat,lon,geometry
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,70699.0,66772,1,1001020100,Census Tract 201,9825304,28435,32.4819731,-86.4915648,"POLYGON ((-86.51038 32.47225, -86.51031 32.472..."
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,50133.0,66772,1,1001020200,Census Tract 202,3320818,5669,32.475758,-86.4724678,"POLYGON ((-86.48127 32.47744, -86.48126 32.477..."
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,70111.0,66772,1,1001020300,Census Tract 203,5349271,9054,32.4740243,-86.4597033,"POLYGON ((-86.47087 32.47573, -86.47084 32.475..."


In [115]:
poverty_df.head(3)

Unnamed: 0,state,geo_id,tract_name,tractId,area_land,area_water,total_pop,poverty_pop,poverty_percent
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,9825304,28435,1941,265,13.652756
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,3320818,5669,1511,257,17.008604
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,5349271,9054,3694,533,14.428803


##### 4.2 Merging the dataset from 4.1 with the poverty dataframe

In [116]:
#Merging the poverty dataframe with the tracts shapefile
inc_pov_merged = pd.merge(income_shp_merged, poverty_df[['tractId','total_pop','poverty_pop','poverty_percent']], on = ['tractId'], how='left')
inc_pov_merged.head(3)

Unnamed: 0,state,geo_id,tract_name_x,tractId,median_income,st_med_income,stateFP,tract_id,tract_name_y,area_land,area_water,lat,lon,geometry,total_pop,poverty_pop,poverty_percent
0,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",1001020100,70699.0,66772,1,1001020100,Census Tract 201,9825304,28435,32.4819731,-86.4915648,"POLYGON ((-86.51038 32.47225, -86.51031 32.472...",1941,265,13.652756
1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",1001020200,50133.0,66772,1,1001020200,Census Tract 202,3320818,5669,32.475758,-86.4724678,"POLYGON ((-86.48127 32.47744, -86.48126 32.477...",1511,257,17.008604
2,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",1001020300,70111.0,66772,1,1001020300,Census Tract 203,5349271,9054,32.4740243,-86.4597033,"POLYGON ((-86.47087 32.47573, -86.47084 32.475...",3694,533,14.428803


##### 4.3 Cleaning up the apprearance of the dataframe

In [117]:
inc_pov_merged.drop(columns=['tract_id', 'tract_name_y'], inplace=True)
#reordering how the columns appear
inc_pov_merged = inc_pov_merged[['stateFP','state','geo_id','tractId','tract_name_x','median_income','st_med_income','total_pop','poverty_pop',\
                                 'poverty_percent','area_land','area_water','lat','lon','geometry']]
#renaming the columns to have less than 10 characters for making it a shapefile
inc_pov_merged.columns = ['stateFP','state','geo_id','tractId','tract_name','med_inc','st_med_inc','total_pop','pov_pop',\
                            'pov_perc','area_land','area_water','lat','lon','geometry']
inc_pov_merged.head(3)

Unnamed: 0,stateFP,state,geo_id,tractId,tract_name,med_inc,st_med_inc,total_pop,pov_pop,pov_perc,area_land,area_water,lat,lon,geometry
0,1,AL,14000US01001020100,1001020100,"Census Tract 201, Autauga County, Alabama",70699.0,66772,1941,265,13.652756,9825304,28435,32.4819731,-86.4915648,"POLYGON ((-86.51038 32.47225, -86.51031 32.472..."
1,1,AL,14000US01001020200,1001020200,"Census Tract 202, Autauga County, Alabama",50133.0,66772,1511,257,17.008604,3320818,5669,32.475758,-86.4724678,"POLYGON ((-86.48127 32.47744, -86.48126 32.477..."
2,1,AL,14000US01001020300,1001020300,"Census Tract 203, Autauga County, Alabama",70111.0,66772,3694,533,14.428803,5349271,9054,32.4740243,-86.4597033,"POLYGON ((-86.47087 32.47573, -86.47084 32.475..."


##### 4.4 Merging the MSA shape file onto the dataframe from 4.3

In [118]:
#Convert the dataframe to a geodataframe
inc_pov_merged = gpd.GeoDataFrame(inc_pov_merged, geometry='geometry')
msa_income_shp = gpd.GeoDataFrame(msa_income_shp, geometry='geometry')

In [119]:
msa_income_shp.head(3)

Unnamed: 0,state,cbsa_id,geo_id,msa_name,msa_tract_,area_land,area_water,msa_med_in,lsad,lat,lon,geometry
0,AL,10700,32000US0110700,"Albertville, AL Micro Area; Alabama",110700,1465523000.0,148503356.0,61200,M2,34.3095637,-86.3216681,"POLYGON ((-86.14981 34.53363, -86.14864 34.533..."
1,AL,10760,32000US0110760,"Alexander City, AL Micro Area; Alabama",110760,3541671000.0,168755442.0,55739,M2,32.9004862,-86.002932,"POLYGON ((-86.00917 33.09026, -86.00717 33.090..."
2,AL,11500,32000US0111500,"Anniston-Oxford, AL Metro Area; Alabama",111500,1569190000.0,16624267.0,59609,M1,33.7705162,-85.8279089,"POLYGON ((-85.63688 33.84649, -85.62875 33.846..."


In [135]:
msa_income_shp.crs == inc_pov_merged.crs

True

In [124]:
#Make a spatial join with the msa_income_shp to get the msa name and msa income for each tract
msa_cols_to_keep = ['state','cbsa_id','geo_id','msa_name','msa_tract_','lsad','msa_med_in','geometry']
msa_income_shp_join = msa_income_shp[msa_cols_to_keep]
#rename the columns to avoid left and right suffixes
msa_income_shp_join.columns = ['state_msa','cbsa_id','geo_id_msa','msa_name','msa_tract','lsad','msa_med_inc','geometry']
final_merged_shp = gpd.sjoin(inc_pov_merged, msa_income_shp_join, how='left', predicate='intersects').drop(columns=['index_right'])
#Reorder the columns to make the appearance of columns better
final_merged_shp = final_merged_shp[['stateFP','state','geo_id','tractId','tract_name','cbsa_id','geo_id_msa','msa_name','msa_tract','lsad',\
                                     'med_inc','st_med_inc','msa_med_inc','total_pop','pov_pop','pov_perc','area_land','area_water','lat','lon','geometry']]
print(final_merged_shp.shape)
final_merged_shp.head(3)

Unnamed: 0,stateFP,state,geo_id,tractId,tract_name,cbsa_id,geo_id_msa,msa_name,msa_tract,lsad,med_inc,st_med_inc,msa_med_inc,total_pop,pov_pop,pov_perc,area_land,area_water,lat,lon,geometry
0,1,AL,14000US01001020100,1001020100,"Census Tract 201, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70699.0,66772,68115.0,1941,265,13.652756,9825304,28435,32.4819731,-86.4915648,"POLYGON ((-86.51038 32.47225, -86.51031 32.472..."
1,1,AL,14000US01001020200,1001020200,"Census Tract 202, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,50133.0,66772,68115.0,1511,257,17.008604,3320818,5669,32.475758,-86.4724678,"POLYGON ((-86.48127 32.47744, -86.48126 32.477..."
2,1,AL,14000US01001020300,1001020300,"Census Tract 203, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70111.0,66772,68115.0,3694,533,14.428803,5349271,9054,32.4740243,-86.4597033,"POLYGON ((-86.47087 32.47573, -86.47084 32.475..."


#### 5. Apply the conditions for classifying the tract as low income community
- The poverty rate is at least 20 percent, OR
- The median family income does not exceed 80 percent of statewide median family income or, if in a metropolitan area, the greater of 80 percent statewide median family income or 80 percent of metropolitan area median family income


In [137]:
final_merged_shp_dissolved = final_merged_shp.dissolve(by='tractId')
final_merged_shp_dissolved.reset_index(inplace=True)
print(final_merged_shp_dissolved.shape)
final_merged_shp_dissolved.head(3)

(85395, 22)


Unnamed: 0,tractId,geometry,stateFP,state,geo_id,tract_name,cbsa_id,geo_id_msa,msa_name,msa_tract,lsad,med_inc,st_med_inc,msa_med_inc,total_pop,pov_pop,pov_perc,area_land,area_water,lat,lon,low_income
0,1001020100,"POLYGON ((-86.51038 32.47225, -86.51031 32.472...",1,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70699.0,66772,68115.0,1941,265,13.652756,9825304,28435,32.4819731,-86.4915648,0
1,1001020200,"POLYGON ((-86.48127 32.47744, -86.48126 32.477...",1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,50133.0,66772,68115.0,1511,257,17.008604,3320818,5669,32.475758,-86.4724678,1
2,1001020300,"POLYGON ((-86.47087 32.47573, -86.47084 32.475...",1,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70111.0,66772,68115.0,3694,533,14.428803,5349271,9054,32.4740243,-86.4597033,0


In [140]:
final_merged_shp_dissolved['low_income'] = 0
#if poverty percentage is greater than 20, then the tract is low income
final_merged_shp_dissolved.loc[final_merged_shp_dissolved['pov_perc'] > 20, 'low_income'] = 1
# if the median income is less than 80% of the state median income, then the tract is low income
final_merged_shp_dissolved.loc[final_merged_shp_dissolved['med_inc'] < final_merged_shp_dissolved['st_med_inc']*0.8, 'low_income'] = 1
#if the median income is less than 80% of the MSA median income in case if LSAD is M1, then the tract is low income
final_merged_shp_dissolved.loc[(final_merged_shp_dissolved['lsad'] == 'M1') & (final_merged_shp_dissolved['med_inc'] < final_merged_shp_dissolved['msa_med_inc']*0.8), 'low_income'] = 1
final_merged_shp_dissolved.head(3)

Unnamed: 0,tractId,geometry,stateFP,state,geo_id,tract_name,cbsa_id,geo_id_msa,msa_name,msa_tract,lsad,med_inc,st_med_inc,msa_med_inc,total_pop,pov_pop,pov_perc,area_land,area_water,lat,lon,low_income
0,1001020100,"POLYGON ((-86.51038 32.47225, -86.51031 32.472...",1,AL,14000US01001020100,"Census Tract 201, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70699.0,66772,68115.0,1941,265,13.652756,9825304,28435,32.4819731,-86.4915648,0
1,1001020200,"POLYGON ((-86.48127 32.47744, -86.48126 32.477...",1,AL,14000US01001020200,"Census Tract 202, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,50133.0,66772,68115.0,1511,257,17.008604,3320818,5669,32.475758,-86.4724678,1
2,1001020300,"POLYGON ((-86.47087 32.47573, -86.47084 32.475...",1,AL,14000US01001020300,"Census Tract 203, Autauga County, Alabama",33860.0,32000US0133860,"Montgomery, AL Metro Area; Alabama",133860.0,M1,70111.0,66772,68115.0,3694,533,14.428803,5349271,9054,32.4740243,-86.4597033,0


In [141]:
final_merged_shp_dissolved['low_income'].value_counts()

0    50461
1    34934
Name: low_income, dtype: int64

In [143]:
%%time
##Save the file as low_income_tracts.shp
if not os.path.exists(wd+ "low_income_tracts"):
    os.makedirs(wd+ "low_income_tracts")
output_path = wd+ "low_income_tracts/low_income_tracts.shp"
final_merged_shp_dissolved.to_file(output_path)



CPU times: total: 45.3 s
Wall time: 1min 40s
