In [1]:
#default_exp HRSA_tracts

In [2]:
# "There are measurement challenges with both the Census and OMB definitions. Some policy 
# experts note that the Census definition classifies quite a bit of suburban area as rural. 
# The OMB definition includes rural areas in Metropolitan counties including, for example, 
# the Grand Canyon which is located in a Metro county. Consequently, one could argue that the 
# Census Bureau standard includes an overcount of the rural population whereas the OMB standard 
# represents an undercount." 

# To get the locations that are rural by HRSA definition, add these census tracts in metro/micro 
# counties to the non-metro/micro counties in the OMB definition. (HRSA is the Health Resources 
# and Services Administration. A unit of HRSA is the Federal Office of Rural Health Policy 
# (FORHP). FORHP is the unit that defines 'rural' in a way that is different from OMB and Census.

# The HRSA definition, in other words, is a refinement on the OMB definition. The Census 
# definition is unrelated to either and later notebooks show that it incorporates the largest 
# number of InfoGroup firms and the largest quantity of rural employment.

# from: https://www.hrsa.gov/rural-health/about-us/definition/index.html(the first paragraph is 
# also from this source):
# "The FORHP accepts all non-Metro counties as rural and uses an additional method of determining 
# rurality called the Rural-Urban Commuting Area (RUCA) codes." RUCA codes are computed from 
# Census data. See 
# https://www.ers.usda.gov/topics/rural-economy-population.aspx and 
# https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/

In [3]:
# This is not the same universe as the ACP project. They "focus on the counties in which more than 80% of a given 
# county falls into the HRSA rural definition." This is not clear. 80% of what? Probably area, but in any case, 
# their definition applies to whole counties.

# To the extent that our analysis is based on individual businesses, we can locate observations within
# areas that are wholly rural; i.e., non-metro counties and rural census tracts in metro counties as defined by
# HRSA/FORHP. 

# At the bottom of this notebook we add flags to identify a record as 'rural' in each of the three 
# (OMB, Census, HRSA) criteria.

In [4]:
import PyPDF2
import pandas as pd
import numpy as np

In [5]:
def combo(row):
    if row['FIPS Code'] == 99999 or row['Census Tract'] == 999999:
        return np.nan
    else:
        #try:
        return str(row['FIPS Code']) + str(row['Census Tract'])
        #except TypeError:
        #    fips = str(int(row['FIPS Code']))
        #    tract = str(int(row['Census Tract']))
        #    return fips + tract
        
def rur3(row):
    if int(row['CBSA Level']) == 0 or row['Full Census Tract'] in rural_tracts:
        return 1
    return 0

In [6]:
# Source: 
# https://www.hrsa.gov/sites/default/files/hrsa/ruralhealth/resources/forhpeligibleareas.pdf
infile = '/InfoGroup/data/rurality/FORHP_eligibleareas.pdf'
# pdf file object
pdfFileObj = open(infile, 'rb')
# pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
npages = pdfReader.numPages
# There are 48 pages in the pdf document. Tract data begins on page 19.
pages = []
for pg in range(0,npages):
    # a page object
    pageObj = pdfReader.getPage(pg)
    # extracting text from page.
    pages.append(pageObj.extractText())
    
pdfFileObj.close()

In [7]:
# hide
with open('/InfoGroup/data/rurality/tract_data.txt','w') as fout:
    for pg in range(19,48):
        fout.write(pages[pg])

In [8]:
# Extract just the tract IDs into a list. First 5 digits of Tract IDs are the same as 
# the State/County FIPS code.
rural_tracts = []

with open('/InfoGroup/data/rurality/tract_data.txt','r') as fin:
    for line in fin:
        if line[0] != chr(32):
            continue
        else:
            line = line.strip()
            try:
                if line[0].isnumeric(): 
                    rural_tracts.append(line)
            except IndexError:
                pass

In [9]:
# hide
with open('/InfoGroup/data/rurality/rural_census_tracts.lis','w') as fout:
    for t in rural_tracts:
        fout.write(t+'\n')

In [10]:
logfile = open('005-HRSA.log','w')

In [11]:
for yr in range(1997,2018):
    print(f'{yr}:',file=logfile)
    infile = f'/InfoGroup/data/rurality/InfoGroup_{yr}_nb04.csv'
    df = pd.read_csv(infile,dtype=object) 
    df['Census Tract'].fillna('999999',inplace=True)
    df['FIPS Code'].fillna('99999',inplace=True)
    df['Full Census Tract'] = df.apply(combo,axis=1)
    df['rural_HRSA'] = df.apply(rur3,axis=1)
    print(df['rural_HRSA'].value_counts(),file=logfile)
    print(df['rural_HRSA'].value_counts(normalize=True) * 100,file=logfile)
    # write a new file
    outfile = f'/InfoGroup/data/rurality/InfoGroup_{yr}_nb05.csv'
    df.to_csv(outfile,index=None)

In [12]:
logfile.close()

In [13]:
# hide
# InfoGroup for 2017 has data on 82,385 census tracts for all (above). The HRSA/FORHP file 
# of rural units lists 2,302 rural census tracts in addition to all those in non-Metro 
# counties. The Census's Zip Code-to-Census Tracts relationship file identifies 74,091 
# census tracts in all states(below).
# The ID of a census tract is the combination of the state FIPS, county FIPS, and tract number.
# infile = '/home/tflory/Relationship_Files/Census_Tract_to_PUMA.csv'
# ct_df = pd.read_csv(infile,usecols=['STATEFP','COUNTYFP','TRACTCE']).drop_duplicates()